Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Commentary: Does it even matter whether you use Microsoft Word or LaTeX?

Shortly before Christmas, PLOS ONE published a paper comparing the efficiency of using Microsoft Word and LaTeX for document preparation:

An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development

The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors.

This study has been criticized for being rigged in various ways to favor Word over LaTeX, which may or may not be the case. However, in my opinion, the much bigger question is this: does the efficiency of the document preparation system used by a researcher even matter?

Most readers of this blog are probably familiar with performance optimization of software. The crucial first step is to profile the program to identify the parts of the code in which most time is being spent. The reason for doing profiling is, that optimization of other parts of the program will make hardly any difference to the overall runtime.

If we want to optimize the efficiency with which we publish research articles, I think it would be fruitful to adopt the same strategy. The first thing we need to do is thus to identify which parts of the process take the most time. In my experience, what takes by far the most time is the actual writing process, which includes reading related work that should be cited. The time spent on document preparation is insignificant compared to the time spent on authoring the text, and the efficiency of the software you use for this task is thus of little importance.

What, then, can you do to become more efficient at writing? My best advice is to start writing the manuscript as soon as you start on a project. Whenever you perform an analysis, document what you did in the Methods section. Whenever you read a paper that may be of relevance to the project, write a one- or two-sentence summary of it in the Introduction section and cite it. The text will look nothing like the final manuscript, but it will be an infinitely better starting point than that scary blank page.

Update: The BuzzCloud for 2014

It has been almost two years since the last BuzzCloud update. So this update is well overdue:

BuzzCloud 2014

As you can see two of the biggest buzzwords in the cloud are computational proteomics and precision medicine. I am obviously quite happy to see that, considering that I currently have an open postdoc position in proteome bioinformatics and is involved in work on data mining of electronic health records.

Job: Postdoc position in proteome bioinformatics

I currently have an opening for a two-year postdoc position in my group Cellular Network Biology at the Novo Nordisk Foundation Center for Protein Research.

The project primarily relates to on computational analysis of mass-spectrometry-based proteomics data. This includes developing new, improved methods for analyzing spectra, optimization of analysis protocols, and application of these to specific datasets. The focus will be on improving the identification of post-translationally modified peptides, which arise in cellular signaling processes and through chemical modification, in particular in ancient samples, which may also consist of mixtures of species. The work will involve close collaboration with the Proteomics Program.

The closing date for applications is December 31, 2014. For further details refer to the job advert.

Analysis: Does a publication matter?

This may seem a strange question to ask for someone working in academia – of course a publication matters, especially if it is cited a lot. However, when it comes to publications about web resources, publications and citations in my opinion mainly serve as somewhat odd proxies on my CV for what really matters: the web resources themselves and how much they are used.

Still, one could hope that a publication about a new web resource would make people aware of its existence and thus attract more users. To analyze this, I took a look at the user statistics of our recently developed resource COMPARTMENTS:

COMPARTMENTS user statistics

Before publishing a paper about it, the web resource had less than 5 unique users per day. Our paper about the resource was accepted on January 26 in the journal Database, which increased the usage to about 10 unique users on a typical weekday. The spike of 41 unique users in a single day was due to me teaching on a course.

So what happened end of June that gave a more than 10-fold increase in the number of users from one day to the next? A new version of GeneCards was released with links to COMPARTMENTS. It seems safe to conclude that the peer-reviewed literature is not where most researchers discover new tools.

Commentary: The 99% of scientific publishing

Last week, John P. A. Ioannidis from Stanford University and Kevin W. Boyack and Richard Klavans from SciTech Strategies, Inc published an interesting analysis of scientific authorships. In the PLOS ONE paper “Estimates of the Continuously Publishing Core in the Scientific Workforce” they describe a small influential core of <1% of researchers who publish each and every year. This analysis appears to have caught the attention of many, including Erik Stokstad from Science Magazine who wrote the short news story “The 1% of scientific publishing”.

You would be excused to think that I belong to the 1%. I published my first paper in 1998 and have published at least one paper every single year since then. However, it turns out that the 1% was defined as the researchers who had published at least one paper every year in the period 1996-2011. Since I published my first paper in 1998, I belong to the other 99% together with everyone else who started their publishing career after 1996 or stopped their career before 2011.

Although the number 1% is making the headlines, the authors seem to be aware of the issue. Of the 15,153,100 researchers with publications in the period 1996-2011, only 150,608 published all 16 years; however, the authors estimate an additional 16,877 scientists published every year in the period 1997-2012. A similar number of continuously publishing scientists will have started their careers all the other years from 1998-2011. Similarly, they an estimated 9,673 researchers stopped their long continuous publishing career in 2010, and presumably all other years in the period 1996-2009. In my opinion, a better estimate is thus that 150,608 + 15*16,877 + 15*9,673 = 548,858 of the 15,153,100 authors have had or will have a 16-year unbroken chain of publications. That amounts to something in the 3-4% range.

That number may still not sound impressive; however, this in no way implies that most researchers do not publish on a regular basis. To have a 16-year unbroken chain of publications, one almost has to stay in academia and become a principal investigator. Most people who publish at least one article and subsequently pursue a career in industry or teaching will count towards the 96-97%. And that is no matter how good a job they do, mind you.