Tag Archives: text mining

Announcement: Microbiome data interpretation workshop


Manimozhiyan Arumugam from the NNF Center for Basic Metabolic Research and I will be organizing a hands-on workshop on microbiome data interpretation in Copenhagen on October 25-26, 2017.

Together with Luis Pedro Coelho from the European Molecular Biology Laboratory and Evangelos Pafilis from the Hellenic Center for Marine Research, we will cover text mining and other useful methods for interpreting microbiome data in the context of the literature.

To apply, register here before August 1, 2017.

Job: Postdoc position on biomedical text mining

A postdoc position is available in the Cellular Network Biology group at the Novo Nordisk Foundation Center for Protein Research (CPR). The group focuses on network-based analysis of proteins and their posttranslational modifications in the context of cellular signaling and disease. The postdoc will work on development of advanced methods for protein-centric text mining of large corpora, including combining efficient dictionary-based named entity-recognition with conditional random fields and classification based on word embedding. The resulting improved methods will in collaboration with other group members be integrated into STRING and related databases/tools, and the software will be made available under open licenses.

Candidates must hold a PhD degree (or equivalent) within a relevant discipline and have strong programming experience. The successful candidates will have strong qualifications or experience within several of the following areas:

  • Text mining
  • Bioinformatics / computational biology
  • Machine learning
  • Statistical data mining
  • Biomedical ontologies
  • Programming in Python and/or C++

For further details please see the official job advert.

Announcement: EMBO practical course on computational biology in Heidelberg

June 2016 will likely be a highly productive month for people in my group, since I will not be there much to disturb them. Specifically, I will be involved in running two week-long EMBO practical courses.

One was announced on this blog just two days ago. The other is the also long-running course “Computational biology: Genomes to systems”, which this year will take place on June 19–23 at the European Molecular Biology Laboratory in Heidelberg, Germany. The course will cover a wide range of advanced computational biology topics, including protein networks (taught by STRING collaborator Christian von Mering) and biomedical text mining (taught by me).

Please note that the application deadline is less than a month away, namely on January 31.

More details can be found on .

Job: Text mining position at Intomics A/S

At Intomics A/S, we are looking for a text-mining expert to perform contract research and develop taylor-made solutions. The job will primarily involve solving text-mining problems for clients in the pharmaceutical industry.

Note that the application deadline is January 15, just over two weeks from now. For further details, please read the job advert below the fold.

Continue reading

Resource: The DISEASES database on disease–gene associations

2015 has been an exceptionally busy year in my group in terms of publishing databases and other web resources; so busy that I have failed to write blog posts describing several of them.

One of them is the DISEASES database, which is described in detail in an article with the informative, if not very inventive title “DISEASES: Text mining and data integration of disease–gene associations”.

The DISEASES database can be viewed as a sister resource to the subcellular localization database COMPARTMENTS, which you can read more about in this blog post. Indeed, the two resources share much of their infrastructure, including the web framework, the backend database, and the text-mining pipeline.

The big difference between the two resources is the scope: whereas COMPARTMENTS links proteins to their subcellular localizations, DISEASES links them to the diseases in which they are implicated. To this end we make use of the Disease Ontology, which turned out to be very well suited for text-mining purposes due to its many synonyms for terms. Text mining is the most important source of associations but is complemented by manually curated associations from Genetics Home Reference and UniProtKB as well as GWAS results imported from DistiLD.

To facilitate usage in large-scale analysis and integration into other databases, all data in DISEASES are available for download. Indeed, the text-mined associations from DISEASES are already included in both GeneCards and Pharos.

Exercise: Web services

The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.

Fetching a URL from the command line

The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'

Submitting text to the tagger

In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.

Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:


As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.

You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:


If you want, you can use the curl command to retrieve the same data from the command line.

Retrieving a protein network

Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:


Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:

You obtain the exact same data in the command line by running this command:

curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'

Putting it all together

Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv

Let us pick apart this monstrosity of a command and see what it does:

  • The first curl command fetches a full-text article from PLOS ONE in XML format
  • The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
  • The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
  • The grep command find only the identifiers that start with “ENSP”, which is the proteins
  • The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
  • Finally, we put that network into a file called string_network.tsv on our server.

In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.

By slightly modifying the command, it is possible to instead retrieve this as an image:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png

STRING network

Exercise: The dictionary-based approach to named entity recognition

In this practical, you will be using a newly developed tool EXTRACT, which shares many parts with the better known tool Reflect. Unlike ABNER, these tools identify named entities by matching a large dictionary against the text.

For this exercise, please install the pre-release version of EXTRACT2 into your web browser’s bookmark bar. If you have problems, e.g. enabling the bookmark bar in your browser, please check the FAQ.

Open in your web browser both the abstract and the full-text version of the article “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)”. Run EXTRACT on both and inspect the results. You can find more details by either click on an annotation or by selection a text region and clicking the bookmarklet again.


  • Does EXTRACT distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can EXTRACT identify which gene/protein a given name refers to?
  • Does it identify any other named entities than genes and proteins in this abstract?
  • Which, if any, additional types of named entities do EXTRACT find in the full-text article?
  • Where in the full-text article do you find most cases of wrong EXTRACT annotations?

Exercise: The machine-learning approach to named entity recognition

In this practical, you will be using the well-known tool ABNER, which relies on a statistical machine-learning method called conditional random fields (CRFs) to recognize entities in text.

To install ABNER, simply download abner.jar from the website. To run it, either double-click on the jar file or type java -jar abner.jar on the command line). As ABNER is a Java program, it requires that you have Java installed on your computer. If not, you need to also download the installer and install it. If you use a Mac, you likely need to go to System Preferences and then Security & Privacy to allow ABNER to be run.

Retrieve the title and abstract of the publication “Novel ZEB2-BCL11B Fusion Gene Identified by RNA-Sequencing in Acute Myeloid Leukemia with t(2;14)(q22;q32)” from PubMed. Use ABNER to annotate named entities according to both the NLPBA and BioCreative probabilistic models models.


  • Do the two models (NLPBA and BioCreative) annotate the same proteins in the text?
  • Does ABNER distinguish between genes and proteins?
  • If so, how can it tell when a name to a gene and when it refers to its protein product?
  • Can ABNER identify which gene/protein a given name refers to?
  • Does ABNER identify any other named entities than genes and proteins in this abstract?

Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Exercise: Hands-on text mining of protein-disease associations


In this exercise you will be processing two sets of documents on prostate cancer and schizophrenia, respectively. You will use a program called tagdir along with a pre-made dictionary of human protein names to identify proteins relevant for each disease and to find a protein that links the two diseases to each other.

The software and data files required for this exercise is available as a tarball. You will need to know how to compile a C++ program to run this exercise; also, the C++ program requires Boost C++ Libraries.

How to run the tagger

To execute the tagger program, you need to run a command with the following format:

tagdir dictionary_file blacklist_file documents_directory > matches_file

The argument dictionary_file should be the name of a tab-delimited file containing the dictionary of names to be tagged in the text. For this exercise you will always want to use the file human_proteins.tsv.

The argument blacklist_file should be the name of a tab-delimited file containing the exact variants of names that should not be tagged despite the name being in the dictionary. To tag every variant of every name, you can simply use an empty file (empty.tsv), but you will also be making your own file later as part of the exercise.

The argument documents_directory should be the name of a directory containing text files with the documents to be processed. For this exercise two such directories will be used, namely prostate_cancer_documents and schizophrenia_documents.

Finally you give a name for where to put the tab-delimited output from the tagger (matches_file). You can call these files whatever you like, but descriptive names will make your life easier, for example, prostate_cancer_matches.tsv.

When running the tagger on one of the two directories of documents, you should expect the run to take approximately 2 minutes.

File formats

The tagger program requires the input to be in very specific formats. The dictionary_file must be a tab-delimited file that looks as follows:

263094 ABCA7
285238 ABCC3
354995 ABCG1
268129 ABHD2
233710 ACADL

The first column must be a number that uniquely identifies the protein in question. In human_proteins.tsv the numbers used are the ENSP identifers from Ensembl, with the letters ENSP and any leading zeros removed. The second column is a name for the protein. If a protein has multiple names, there will be several lines, each listing the same number (i.e. the same protein) but a different name for it.

The blacklist_file must similarly be a tab-delimited file in which each line specifies a specific variant of a name and whether it should be blocked or not:

activin f
act t

The letter “t” in the second column means TRUE (i.e. that the variant should be blocked) and “f” means FALSE (i.e. that it should not be blocked). The file above would thus block ACP, ACS, and act but not ACTH and activin. Because variants are by default not blocked, you in principle only need to add lines with “t”; however, the lines with “f” can be very useful to keep track at which variants you have already looked at and actively decided not to block, as opposed to variants that are not blocked because you have not looked at them.

The output of the tagger is also tab-delimited, each line specifying a possibly meaning of a match in a document:

7478532.txt 254 256 TSG 262120
7478532.txt 254 256 TSG 416324
7478532.txt 595 597 LPL 309757
7478532.txt 595 597 LPL 315757
7478532.txt 658 661 NEFL 221169
7478532.txt 736 738 LPL 309757
7478532.txt 736 738 LPL 315757

The first column tells the name of the file (the numbers are in our case the PubMed IDs of the abstracts). The next two columns specify from which character which character there is a match. The fourth column shows which string appeared in that place in the document. The fifth column specific which protein this could be (i.e. the Ensemble protein number). The first line in the output above thus means that the document 7478532.txt from character 254 to character 256 says TSG, which could mean the protein ENSP00000262120. The second line shows that it alternatively could be ENSP00000416324.

Making a blacklist

As you will see if you run the tagdir using an empty blacklist_file, a simple dictionary matching approach leads to very many wrong matches unless it is complemented by a good list of name variants to be blocked.

To identify potential name variants that you might want to put on the black list, you will want to count the number of occurrences of each and every name variant in an entire corpus of text. A very helpful UNIX command for doing this based on a matches_file (produced by tagdir) is:

cut -f 4 matches_file | sort | uniq -c | sort -nr | less

What it does is first cut out the exact name variants in column 4 of your matches file (cut -f 4), sort them so that multiple copies of the same variant will be right after each other (sort), count identical adjacent lines (uniq -c), sort that list reverse numerical so that the highest counts come first (sort -nr), and show the resulting in a program so that you can scroll through it (less).

The next step is to simply go through the most frequently occurring name variants and manually decide which ones are indeed correct protein names that just occur very frequently, and which ones occur very frequently because they mean something entirely different than the protein. The latter should be put on your blacklist.

Once you have produced a blacklist, you can rerun the tagdir command and you will get much less output because vast numbers of false positive matches have been filtered away. You will find that adding just the worst few offenders to the blacklist will help much; however, as you go further down the list the effort spent on inspecting more names gives diminishing returns.

Finding proteins associated to a disease

Making a very good blacklist would take too long for this exercise. For the following exercise it is thus recommended that you use the provided file blacklist_10.tsv, which we have provided.

The goal is now to produce a top-20 list of proteins that are most often mentioned in papers about prostate cancer. To achieve this, you need to rerun the tagging of human proteins in the prostate cancer documents, making use of the above mentioned blacklist. To extract a top-20, you can use a command similar to the one used for create the list of most commonly appearing name variants.

Find a protein linked to both diseases

Starting from the top-20 list of prostate cancer proteins, find one or more proteins that are also frequently mentioned in papers on schizophrenia. To do so, you will need to tag also the set of documents on schizophrenia, count the number of mentions of every protein, and check the count of every protein on the top-20 list for prostate cancer in the schizophrenia set.