Monthly Archives: November 2015

Exercise: Web services

The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.

Fetching a URL from the command line

The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'

Submitting text to the tagger

In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.

Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:

http://tagger.jensenlab.org/Extract?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26

As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.

You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:

http://tagger.jensenlab.org/GetEntities?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26&format=tsv

If you want, you can use the curl command to retrieve the same data from the command line.

Retrieving a protein network

Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:

http://string-db.org/api/image/network?identifier=ENSP00000349723

Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:

http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723
You obtain the exact same data in the command line by running this command:

curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'

Putting it all together

Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv

Let us pick apart this monstrosity of a command and see what it does:

  • The first curl command fetches a full-text article from PLOS ONE in XML format
  • The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
  • The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
  • The grep command find only the identifiers that start with “ENSP”, which is the proteins
  • The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
  • Finally, we put that network into a file called string_network.tsv on our server.

In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.

By slightly modifying the command, it is possible to instead retrieve this as an image:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png

STRING network

Advertisements

Exercise: The dictionary-based approach to named entity recognition

In this practical, you will be using a newly developed tool EXTRACT, which shares many parts with the better known tool Reflect. Unlike ABNER, these tools identify named entities by matching a large dictionary against the text.

For this exercise, please install the pre-release version of EXTRACT2 into your web browser’s bookmark bar. If you have problems, e.g. enabling the bookmark bar in your browser, please check the FAQ.

Open in your web browser both the abstract and the full-text version of the article “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)”. Run EXTRACT on both and inspect the results. You can find more details by either click on an annotation or by selection a text region and clicking the bookmarklet again.

Questions

  • Does EXTRACT distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can EXTRACT identify which gene/protein a given name refers to?
  • Does it identify any other named entities than genes and proteins in this abstract?
  • Which, if any, additional types of named entities do EXTRACT find in the full-text article?
  • Where in the full-text article do you find most cases of wrong EXTRACT annotations?

Exercise: The machine-learning approach to named entity recognition

In this practical, you will be using the well-known tool ABNER, which relies on a statistical machine-learning method called conditional random fields (CRFs) to recognize entities in text.

To install ABNER, simply download abner.jar from the website. To run it, either double-click on the jar file or type java -jar abner.jar on the command line). As ABNER is a Java program, it requires that you have Java installed on your computer. If not, you need to also download the installer and install it. If you use a Mac, you likely need to go to System Preferences and then Security & Privacy to allow ABNER to be run.

Retrieve the title and abstract of the publication “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)” from PubMed. Use ABNER to annotate named entities according to both the NLPBA and BioCreative probabilistic models models.

Questions

  • Do the two models (NLPBA and BioCreative) annotate the same proteins in the text?
  • Does ABNER distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can ABNER identify which gene/protein a given name refers to?
  • Does ABNER identify any other named entities than genes and proteins in this abstract?