Category Archives: Analysis

Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Analysis: Does a publication matter?

This may seem a strange question to ask for someone working in academia – of course a publication matters, especially if it is cited a lot. However, when it comes to publications about web resources, publications and citations in my opinion mainly serve as somewhat odd proxies on my CV for what really matters: the web resources themselves and how much they are used.

Still, one could hope that a publication about a new web resource would make people aware of its existence and thus attract more users. To analyze this, I took a look at the user statistics of our recently developed resource COMPARTMENTS:

COMPARTMENTS user statistics

Before publishing a paper about it, the web resource had less than 5 unique users per day. Our paper about the resource was accepted on January 26 in the journal Database, which increased the usage to about 10 unique users on a typical weekday. The spike of 41 unique users in a single day was due to me teaching on a course.

So what happened end of June that gave a more than 10-fold increase in the number of users from one day to the next? A new version of GeneCards was released with links to COMPARTMENTS. It seems safe to conclude that the peer-reviewed literature is not where most researchers discover new tools.

Analysis: Science used to be simpler

I guess most people have a feeling that life used to be simpler in the past. The other day it occurred to me that we researchers very often talk about how advanced our methods are, although simple methods are in many cases preferable.

So this morning I resorted to my usual strategy for analyzing such things, namely counting in Medline. More specifically I calculated for each year the percentage of publication titles that contain the words “simple” and “advanced”, respectively. In the plot below, the dots show the values for each year and the lines show five-year running averages thereof (click for PDF version):


As can be clearly seen, life as a researcher was indeed simpler in the 50s and 60s.

Analysis: When will your BMC paper be typeset?

One month ago, people from Jan Gorodkin’s group and my own group published a paper in BMC Systems Biology. This happened after a very long process during which we were very close to retracting the manuscript due to inaction by the editor and sending it elsewhere. In the end it got accepted, but even now there is only the provisional PDF available. The paper has still not been typeset.

Typesetting is one of very few things an online-only journal does to add value. Publishers often claim to add value by organizing peer review, but if you think about it, they pass the manuscript to an unpaid editor who subsequently recruits unpaid referees to review it. Careful copyediting and typesetting of the final, accepted manuscript is thus in my view the only hands-on work that most journals do for their considerable article-processing charge. Neil Saunders’ recent blog post “We really don’t care what statistical method you used” illustrates well the care with which copy editing is done. We are thus down to only one service actually done by the publishers: typesetting the manuscript to produce XML, HTML, and PDF versions of it.

You would thus hope that typesetting at least happens promptly once a manuscript is accepted and the authors have paid. However, I have been frustrated to find that both my own manuscript in BMC Systems Biology and many manuscripts that I have downloaded from BMC journals exist only as provisional PDFs even months after publication. I thus decided to quantify to which extent typesetting of papers is delayed. To this end, I considered all papers published in each journal during the months May-July this year and calculated which percentage of them had been typeset by now.

Starting with BMC Systems Biology, here are the numbers: 7 of 26 papers from May, 3 of 24 papers from June, and 1 of 15 papers from July have been typeset to date. The numbers for BMC Bioinformatics turned out to be as disappointing: 6 of 52, 7 of 36 and 1 of 32 papers from May, June, and July have been typeset so far. And BMC Genomics confirmed the trend: 17 of 56, 14 of 74, and 11 of 67 are the numbers for May, June, and July. This adds up to only 16.9%, 10.6%, and 21.3% of papers from May-July having been typeset by BMC Systems Biology, BMC Bioinformatics, and BMC Genomics, respectively.

I continued to check other journals from BioMed Central, Chemistry Central, and SpringerOpen journals, which all are open access journals owned by Springer. The results were the same. The percentages of papers from May-July that had been typeset were 6.2%, 20.0%, and 9.0% for Proteome Science, Chemistry Central Journal, and Critical Ultrasound Journal, respectively.

To make a long, depressing story short, I should expect to wait for at least another three months before I see a typeset version of my paper. Can someone please remind me why we, the researchers, pay for this?

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Is PeerJ cheaper than other Open Access journals?

The newly announced Open Access journal PeerJ has caused quite a fuzz, not least because of their catch phrase: “If we can set a goal to sequence the human genome for $99, then why not $99 for scholarly publishing?”

This at first sounds very cheap; however, the $99 is not what you pay per accepted paper. PeerJ operates under a different scheme than traditional Open Access journals: instead of paying per publication, you pay a one-time fee that you pay to be able to publish in PeerJ for life. This sounds almost too good to be true.

There are a few catches, however. Firstly, $99 only entitles you to submit one manuscript per year to PeerJ. If you want to be able to submit two manuscripts per year or unlimited manuscripts, the price rises to $169 and $259 respectively.
Secondly, all authors on a manuscript must be paying PeerJ members at the time of submission (except if there are more than twelve authors, in which case it is enough that 12 of them are members). This suddenly makes the comparison to other Open Access journals much more complex, as the actual average price per manuscript depends on the number of authors, the number of other PeerJ manuscripts submitted by the same authors in their lifetime, and the acceptance rate of PeerJ. In this post I try to do the math and compare PeerJ to traditional Open Access journals, where you pay per accepted publication.

PeerJ compares itself to PLoS ONE, so I base all comparisons on that. From 2006 when PLoS ONE was launched up to and including 2011, a total of 29,042 publications have appeared with a total of 150,020 authorships. This amounts to an average of 5.1 authors per publication. When PeerJ is initially launched, no authors will have the benefit of already being members, so at first this implies that all authors will have to pay an average cost of $99*5.1 = $511 per submitted manuscript (ignoring the discount on manuscripts with 12+ authors). According to the PeerJ FAQ, this is expected to be approximately 70%. Assuming that this holds true, the average cost incurred by the authors per accepted paper will be $511/0.7 = $730. This is already considerably less than PLoS ONE, which has a publication fee of $1350 per accepted paper. From a pure cost point-of-view, PeerJ thus looks to be about half the price of PLoS ONE.

I do have some concerns related to the model of charging per author. First, I find it to be illogical, since the actual costs related to handing a manuscript are independent of the number of authors. Second, the average number of authors per paper varies between research fields, which implies that the average fee per manuscript will in some fields be higher than $730. For a manuscript with 12 authors, neither of whom are already PeerJ members, the fee per accepted manuscript is $99*12/0.7 = $1697, which is more expensive than PLoS ONE. Third, the new model gives a direct financial incentive to not include authors who made minor contributions.

In summary, I think PeerJ is a refreshing new idea – I can only applaud efforts to lower the price of scientific publishing. However, although $99 for scientific publishing sounds revolutionarily cheap, PeerJ will at first only be ~2x cheaper then PLoS ONE. Also, the new payment model, which effectively boils down to a per-author charge, is in my opinion not without its own problems.

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Christmas no longer in vogue!

I have just made an alarming discovery: judging from the biomedical literature, researchers appear to increasingly ignore Christmas.

My plan was to make a funny Christmas post looking at trivialities such as when during the year Christmas-related papers are posted. To this end, I did a trivial text-mining analysis that pulled out all papers mentioning “Christmas”, “Xmas”, or “X-mas” in the title or abstract. As a first check of the data, I looked at how many papes were published each year and was surprised to find only 20-30 in a typical year. To eliminate random fluctuations due to the low counts, I thus binned the data into decades before plotting the temporal trend (black dots are actual data points, red curve is a quadratic trendline):

The shocking result is that the frequency of Christmas-related papers has steadily dropped to less than half of what it was in the 1950s!

How can this be? I can think of several possibilities, and you are welcome to come with more in the comments:

  • We are running out of new funny things to say about Christmas.
  • An increasing proportion of researchers come from countries, in which Christmas is not widely celebrated.
  • Researchers have collectively stopped believing in Santa, as funding has dried up.

Merry Christmas Everyone!

Analysis: Toward doing science

Yesterday, Rangarajan and coworkers published a paper in BMC Bioinformtatics entitled “Toward an interactive article: integrating journals and biological databases”. Not many hours later Neil Saunders made the following tweet commenting on it:

Can we ban use of "toward(s)" in article titles?

This reminded me of a draft blog post that I wrote in 2008 on the use of the word “toward(s)” in article titles, and I decided that it was time to update the plot and finally publish it. The background was that I had the gut feeling that there was a somewhat disturbing trend, namely that more and more papers use these words in the title. I thus went to Medline and counted the fraction of papers from each year having a title starting with “toward” or “towards” (I also included them if towards appeared inside the title following a colon, semicolon, or dash):

The plot shows that fraction of articles with “toward(s)” in the title is rapidly rising; it has more than tripled over the past two decades. There is thus no doubt that the use of “toward(s)” in article titles is a trend in biomedical publishing.

As is often the case with statistics, though, this analysis answers only one question but leads to several new ones. Are we increasingly selling our papers on what we hope to do soon rather than on what we have actually done? Or have we just become more honest by now adding the word “toward(s)” where we might have left it out in the past?