The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.
Fetching a URL from the command line
The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:
curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'
Submitting text to the tagger
In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.
Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:
http://tagger.jensenlab.org/Extract?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26
As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.
You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:
http://tagger.jensenlab.org/GetEntities?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26&format=tsv
If you want, you can use the curl
command to retrieve the same data from the command line.
Retrieving a protein network
Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:
http://string-db.org/api/image/network?identifier=ENSP00000349723
Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:
http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723
You obtain the exact same data in the command line by running this command:
curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'
Putting it all together
Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:
curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv
Let us pick apart this monstrosity of a command and see what it does:
- The first curl command fetches a full-text article from PLOS ONE in XML format
- The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
- The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
- The grep command find only the identifiers that start with “ENSP”, which is the proteins
- The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
- Finally, we put that network into a file called string_network.tsv on our server.
In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.
By slightly modifying the command, it is possible to instead retrieve this as an image:
curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png