Tag Archives: text mining

Announcement: Microbiome data interpretation workshop

Microbiome_workshop

Manimozhiyan Arumugam from the NNF Center for Basic Metabolic Research and I will be organizing a hands-on workshop on microbiome data interpretation in Copenhagen on October 25-26, 2017.

Together with Luis Pedro Coelho from the European Molecular Biology Laboratory and Evangelos Pafilis from the Hellenic Center for Marine Research, we will cover text mining and other useful methods for interpreting microbiome data in the context of the literature.

To apply, register here before August 1, 2017.

Advertisements

Job: Postdoc position on biomedical text mining

A postdoc position is available in the Cellular Network Biology group at the Novo Nordisk Foundation Center for Protein Research (CPR). The group focuses on network-based analysis of proteins and their posttranslational modifications in the context of cellular signaling and disease. The postdoc will work on development of advanced methods for protein-centric text mining of large corpora, including combining efficient dictionary-based named entity-recognition with conditional random fields and classification based on word embedding. The resulting improved methods will in collaboration with other group members be integrated into STRING and related databases/tools, and the software will be made available under open licenses.

Candidates must hold a PhD degree (or equivalent) within a relevant discipline and have strong programming experience. The successful candidates will have strong qualifications or experience within several of the following areas:

  • Text mining
  • Bioinformatics / computational biology
  • Machine learning
  • Statistical data mining
  • Biomedical ontologies
  • Programming in Python and/or C++

For further details please see the official job advert.

Announcement: EMBO practical course on computational biology in Heidelberg

June 2016 will likely be a highly productive month for people in my group, since I will not be there much to disturb them. Specifically, I will be involved in running two week-long EMBO practical courses.

One was announced on this blog just two days ago. The other is the also long-running course “Computational biology: Genomes to systems”, which this year will take place on June 19–23 at the European Molecular Biology Laboratory in Heidelberg, Germany. The course will cover a wide range of advanced computational biology topics, including protein networks (taught by STRING collaborator Christian von Mering) and biomedical text mining (taught by me).

Please note that the application deadline is less than a month away, namely on January 31.

More details can be found on .

Job: Text mining position at Intomics A/S

At Intomics A/S, we are looking for a text-mining expert to perform contract research and develop taylor-made solutions. The job will primarily involve solving text-mining problems for clients in the pharmaceutical industry.

Note that the application deadline is January 15, just over two weeks from now. For further details, please read the job advert below the fold.

Continue reading

Resource: The DISEASES database on disease–gene associations

2015 has been an exceptionally busy year in my group in terms of publishing databases and other web resources; so busy that I have failed to write blog posts describing several of them.

One of them is the DISEASES database, which is described in detail in an article with the informative, if not very inventive title “DISEASES: Text mining and data integration of disease–gene associations”.

The DISEASES database can be viewed as a sister resource to the subcellular localization database COMPARTMENTS, which you can read more about in this blog post. Indeed, the two resources share much of their infrastructure, including the web framework, the backend database, and the text-mining pipeline.

The big difference between the two resources is the scope: whereas COMPARTMENTS links proteins to their subcellular localizations, DISEASES links them to the diseases in which they are implicated. To this end we make use of the Disease Ontology, which turned out to be very well suited for text-mining purposes due to its many synonyms for terms. Text mining is the most important source of associations but is complemented by manually curated associations from Genetics Home Reference and UniProtKB as well as GWAS results imported from DistiLD.

To facilitate usage in large-scale analysis and integration into other databases, all data in DISEASES are available for download. Indeed, the text-mined associations from DISEASES are already included in both GeneCards and Pharos.

Exercise: Web services

The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.

Fetching a URL from the command line

The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'

Submitting text to the tagger

In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.

Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:

http://tagger.jensenlab.org/Extract?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26

As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.

You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:

http://tagger.jensenlab.org/GetEntities?document=Novel%20ZEB2-BCL11B%20Fusion%20Gene%20Identified%20by%20RNA-Sequencing%20in%20Acute%20Myeloid%20Leukemia%20with%20t(2;14)(q22;q32)&entity_types=9606%20-26&format=tsv

If you want, you can use the curl command to retrieve the same data from the command line.

Retrieving a protein network

Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:

http://string-db.org/api/image/network?identifier=ENSP00000349723

Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:

http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723
You obtain the exact same data in the command line by running this command:

curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'

Putting it all together

Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv

Let us pick apart this monstrosity of a command and see what it does:

  • The first curl command fetches a full-text article from PLOS ONE in XML format
  • The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
  • The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
  • The grep command find only the identifiers that start with “ENSP”, which is the proteins
  • The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
  • Finally, we put that network into a file called string_network.tsv on our server.

In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.

By slightly modifying the command, it is possible to instead retrieve this as an image:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png

STRING network

Exercise: The dictionary-based approach to named entity recognition

In this practical, you will be using a newly developed tool EXTRACT, which shares many parts with the better known tool Reflect. Unlike ABNER, these tools identify named entities by matching a large dictionary against the text.

For this exercise, please install the pre-release version of EXTRACT2 into your web browser’s bookmark bar. If you have problems, e.g. enabling the bookmark bar in your browser, please check the FAQ.

Open in your web browser both the abstract and the full-text version of the article “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)”. Run EXTRACT on both and inspect the results. You can find more details by either click on an annotation or by selection a text region and clicking the bookmarklet again.

Questions

  • Does EXTRACT distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can EXTRACT identify which gene/protein a given name refers to?
  • Does it identify any other named entities than genes and proteins in this abstract?
  • Which, if any, additional types of named entities do EXTRACT find in the full-text article?
  • Where in the full-text article do you find most cases of wrong EXTRACT annotations?