Exercise: Web services

The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.

Fetching a URL from the command line

The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'

Submitting text to the tagger

In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.

Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:


As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.

You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:


If you want, you can use the curl command to retrieve the same data from the command line.

Retrieving a protein network

Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:


Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:

You obtain the exact same data in the command line by running this command:

curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'

Putting it all together

Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv

Let us pick apart this monstrosity of a command and see what it does:

  • The first curl command fetches a full-text article from PLOS ONE in XML format
  • The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
  • The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
  • The grep command find only the identifiers that start with “ENSP”, which is the proteins
  • The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
  • Finally, we put that network into a file called string_network.tsv on our server.

In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.

By slightly modifying the command, it is possible to instead retrieve this as an image:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png

STRING network

Exercise: The dictionary-based approach to named entity recognition

In this practical, you will be using a newly developed tool EXTRACT, which shares many parts with the better known tool Reflect. Unlike ABNER, these tools identify named entities by matching a large dictionary against the text.

For this exercise, please install the pre-release version of EXTRACT2 into your web browser’s bookmark bar. If you have problems, e.g. enabling the bookmark bar in your browser, please check the FAQ.

Open in your web browser both the abstract and the full-text version of the article “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)”. Run EXTRACT on both and inspect the results. You can find more details by either click on an annotation or by selection a text region and clicking the bookmarklet again.


  • Does EXTRACT distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can EXTRACT identify which gene/protein a given name refers to?
  • Does it identify any other named entities than genes and proteins in this abstract?
  • Which, if any, additional types of named entities do EXTRACT find in the full-text article?
  • Where in the full-text article do you find most cases of wrong EXTRACT annotations?

Exercise: The machine-learning approach to named entity recognition

In this practical, you will be using the well-known tool ABNER, which relies on a statistical machine-learning method called conditional random fields (CRFs) to recognize entities in text.

To install ABNER, simply download abner.jar from the website. To run it, either double-click on the jar file or type java -jar abner.jar on the command line). As ABNER is a Java program, it requires that you have Java installed on your computer. If not, you need to also download the installer and install it. If you use a Mac, you likely need to go to System Preferences and then Security & Privacy to allow ABNER to be run.

Retrieve the title and abstract of the publication “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)” from PubMed. Use ABNER to annotate named entities according to both the NLPBA and BioCreative probabilistic models models.


  • Do the two models (NLPBA and BioCreative) annotate the same proteins in the text?
  • Does ABNER distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can ABNER identify which gene/protein a given name refers to?
  • Does ABNER identify any other named entities than genes and proteins in this abstract?

Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Commentary: Does it even matter whether you use Microsoft Word or LaTeX?

Shortly before Christmas, PLOS ONE published a paper comparing the efficiency of using Microsoft Word and LaTeX for document preparation:

An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development

The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors.

This study has been criticized for being rigged in various ways to favor Word over LaTeX, which may or may not be the case. However, in my opinion, the much bigger question is this: does the efficiency of the document preparation system used by a researcher even matter?

Most readers of this blog are probably familiar with performance optimization of software. The crucial first step is to profile the program to identify the parts of the code in which most time is being spent. The reason for doing profiling is, that optimization of other parts of the program will make hardly any difference to the overall runtime.

If we want to optimize the efficiency with which we publish research articles, I think it would be fruitful to adopt the same strategy. The first thing we need to do is thus to identify which parts of the process take the most time. In my experience, what takes by far the most time is the actual writing process, which includes reading related work that should be cited. The time spent on document preparation is insignificant compared to the time spent on authoring the text, and the efficiency of the software you use for this task is thus of little importance.

What, then, can you do to become more efficient at writing? My best advice is to start writing the manuscript as soon as you start on a project. Whenever you perform an analysis, document what you did in the Methods section. Whenever you read a paper that may be of relevance to the project, write a one- or two-sentence summary of it in the Introduction section and cite it. The text will look nothing like the final manuscript, but it will be an infinitely better starting point than that scary blank page.

Update: The BuzzCloud for 2014

It has been almost two years since the last BuzzCloud update. So this update is well overdue:

BuzzCloud 2014

As you can see two of the biggest buzzwords in the cloud are computational proteomics and precision medicine. I am obviously quite happy to see that, considering that I currently have an open postdoc position in proteome bioinformatics and is involved in work on data mining of electronic health records.