Category Archives: Analysis

Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Analysis: Does a publication matter?

This may seem a strange question to ask for someone working in academia – of course a publication matters, especially if it is cited a lot. However, when it comes to publications about web resources, publications and citations in my opinion mainly serve as somewhat odd proxies on my CV for what really matters: the web resources themselves and how much they are used.

Still, one could hope that a publication about a new web resource would make people aware of its existence and thus attract more users. To analyze this, I took a look at the user statistics of our recently developed resource COMPARTMENTS:

COMPARTMENTS user statistics

Before publishing a paper about it, the web resource had less than 5 unique users per day. Our paper about the resource was accepted on January 26 in the journal Database, which increased the usage to about 10 unique users on a typical weekday. The spike of 41 unique users in a single day was due to me teaching on a course.

So what happened end of June that gave a more than 10-fold increase in the number of users from one day to the next? A new version of GeneCards was released with links to COMPARTMENTS. It seems safe to conclude that the peer-reviewed literature is not where most researchers discover new tools.

Analysis: Science used to be simpler

I guess most people have a feeling that life used to be simpler in the past. The other day it occurred to me that we researchers very often talk about how advanced our methods are, although simple methods are in many cases preferable.

So this morning I resorted to my usual strategy for analyzing such things, namely counting in Medline. More specifically I calculated for each year the percentage of publication titles that contain the words “simple” and “advanced”, respectively. In the plot below, the dots show the values for each year and the lines show five-year running averages thereof (click for PDF version):


As can be clearly seen, life as a researcher was indeed simpler in the 50s and 60s.

Analysis: When will your BMC paper be typeset?

One month ago, people from Jan Gorodkin’s group and my own group published a paper in BMC Systems Biology. This happened after a very long process during which we were very close to retracting the manuscript due to inaction by the editor and sending it elsewhere. In the end it got accepted, but even now there is only the provisional PDF available. The paper has still not been typeset.

Typesetting is one of very few things an online-only journal does to add value. Publishers often claim to add value by organizing peer review, but if you think about it, they pass the manuscript to an unpaid editor who subsequently recruits unpaid referees to review it. Careful copyediting and typesetting of the final, accepted manuscript is thus in my view the only hands-on work that most journals do for their considerable article-processing charge. Neil Saunders’ recent blog post “We really don’t care what statistical method you used” illustrates well the care with which copy editing is done. We are thus down to only one service actually done by the publishers: typesetting the manuscript to produce XML, HTML, and PDF versions of it.

You would thus hope that typesetting at least happens promptly once a manuscript is accepted and the authors have paid. However, I have been frustrated to find that both my own manuscript in BMC Systems Biology and many manuscripts that I have downloaded from BMC journals exist only as provisional PDFs even months after publication. I thus decided to quantify to which extent typesetting of papers is delayed. To this end, I considered all papers published in each journal during the months May-July this year and calculated which percentage of them had been typeset by now.

Starting with BMC Systems Biology, here are the numbers: 7 of 26 papers from May, 3 of 24 papers from June, and 1 of 15 papers from July have been typeset to date. The numbers for BMC Bioinformatics turned out to be as disappointing: 6 of 52, 7 of 36 and 1 of 32 papers from May, June, and July have been typeset so far. And BMC Genomics confirmed the trend: 17 of 56, 14 of 74, and 11 of 67 are the numbers for May, June, and July. This adds up to only 16.9%, 10.6%, and 21.3% of papers from May-July having been typeset by BMC Systems Biology, BMC Bioinformatics, and BMC Genomics, respectively.

I continued to check other journals from BioMed Central, Chemistry Central, and SpringerOpen journals, which all are open access journals owned by Springer. The results were the same. The percentages of papers from May-July that had been typeset were 6.2%, 20.0%, and 9.0% for Proteome Science, Chemistry Central Journal, and Critical Ultrasound Journal, respectively.

To make a long, depressing story short, I should expect to wait for at least another three months before I see a typeset version of my paper. Can someone please remind me why we, the researchers, pay for this?

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Is PeerJ cheaper than other Open Access journals?

The newly announced Open Access journal PeerJ has caused quite a fuzz, not least because of their catch phrase: “If we can set a goal to sequence the human genome for $99, then why not $99 for scholarly publishing?”

This at first sounds very cheap; however, the $99 is not what you pay per accepted paper. PeerJ operates under a different scheme than traditional Open Access journals: instead of paying per publication, you pay a one-time fee that you pay to be able to publish in PeerJ for life. This sounds almost too good to be true.

There are a few catches, however. Firstly, $99 only entitles you to submit one manuscript per year to PeerJ. If you want to be able to submit two manuscripts per year or unlimited manuscripts, the price rises to $169 and $259 respectively.
Secondly, all authors on a manuscript must be paying PeerJ members at the time of submission (except if there are more than twelve authors, in which case it is enough that 12 of them are members). This suddenly makes the comparison to other Open Access journals much more complex, as the actual average price per manuscript depends on the number of authors, the number of other PeerJ manuscripts submitted by the same authors in their lifetime, and the acceptance rate of PeerJ. In this post I try to do the math and compare PeerJ to traditional Open Access journals, where you pay per accepted publication.

PeerJ compares itself to PLoS ONE, so I base all comparisons on that. From 2006 when PLoS ONE was launched up to and including 2011, a total of 29,042 publications have appeared with a total of 150,020 authorships. This amounts to an average of 5.1 authors per publication. When PeerJ is initially launched, no authors will have the benefit of already being members, so at first this implies that all authors will have to pay an average cost of $99*5.1 = $511 per submitted manuscript (ignoring the discount on manuscripts with 12+ authors). According to the PeerJ FAQ, this is expected to be approximately 70%. Assuming that this holds true, the average cost incurred by the authors per accepted paper will be $511/0.7 = $730. This is already considerably less than PLoS ONE, which has a publication fee of $1350 per accepted paper. From a pure cost point-of-view, PeerJ thus looks to be about half the price of PLoS ONE.

I do have some concerns related to the model of charging per author. First, I find it to be illogical, since the actual costs related to handing a manuscript are independent of the number of authors. Second, the average number of authors per paper varies between research fields, which implies that the average fee per manuscript will in some fields be higher than $730. For a manuscript with 12 authors, neither of whom are already PeerJ members, the fee per accepted manuscript is $99*12/0.7 = $1697, which is more expensive than PLoS ONE. Third, the new model gives a direct financial incentive to not include authors who made minor contributions.

In summary, I think PeerJ is a refreshing new idea – I can only applaud efforts to lower the price of scientific publishing. However, although $99 for scientific publishing sounds revolutionarily cheap, PeerJ will at first only be ~2x cheaper then PLoS ONE. Also, the new payment model, which effectively boils down to a per-author charge, is in my opinion not without its own problems.

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Christmas no longer in vogue!

I have just made an alarming discovery: judging from the biomedical literature, researchers appear to increasingly ignore Christmas.

My plan was to make a funny Christmas post looking at trivialities such as when during the year Christmas-related papers are posted. To this end, I did a trivial text-mining analysis that pulled out all papers mentioning “Christmas”, “Xmas”, or “X-mas” in the title or abstract. As a first check of the data, I looked at how many papes were published each year and was surprised to find only 20-30 in a typical year. To eliminate random fluctuations due to the low counts, I thus binned the data into decades before plotting the temporal trend (black dots are actual data points, red curve is a quadratic trendline):

The shocking result is that the frequency of Christmas-related papers has steadily dropped to less than half of what it was in the 1950s!

How can this be? I can think of several possibilities, and you are welcome to come with more in the comments:

  • We are running out of new funny things to say about Christmas.
  • An increasing proportion of researchers come from countries, in which Christmas is not widely celebrated.
  • Researchers have collectively stopped believing in Santa, as funding has dried up.

Merry Christmas Everyone!

Analysis: Toward doing science

Yesterday, Rangarajan and coworkers published a paper in BMC Bioinformtatics entitled “Toward an interactive article: integrating journals and biological databases”. Not many hours later Neil Saunders made the following tweet commenting on it:

Can we ban use of "toward(s)" in article titles?

This reminded me of a draft blog post that I wrote in 2008 on the use of the word “toward(s)” in article titles, and I decided that it was time to update the plot and finally publish it. The background was that I had the gut feeling that there was a somewhat disturbing trend, namely that more and more papers use these words in the title. I thus went to Medline and counted the fraction of papers from each year having a title starting with “toward” or “towards” (I also included them if towards appeared inside the title following a colon, semicolon, or dash):

The plot shows that fraction of articles with “toward(s)” in the title is rapidly rising; it has more than tripled over the past two decades. There is thus no doubt that the use of “toward(s)” in article titles is a trend in biomedical publishing.

As is often the case with statistics, though, this analysis answers only one question but leads to several new ones. Are we increasingly selling our papers on what we hope to do soon rather than on what we have actually done? Or have we just become more honest by now adding the word “toward(s)” where we might have left it out in the past?

Analysis: 10butnotMe

About five years ago George Church announced the Personal Genome Project (PGP). A very interesting aspect of this project is that all data are released under the Creative Commons Zero waiver. This includes not only the genetic data, but also some medical information and even the identity of each individual.

Although PGP has enrolled more than a thousand individuals, it is presently only possible to download data on ten individuals. It is obviously pointless to attempt to link genotype to phenotype based on such a small number of individuals. However, I wondered if any meaningful structure would emerge if I calculated the Hamming distances for all pairs of individuals, that is the number of SNPs by which they differ (download).

Like said so done. I downloaded all available SNP data from PGP (including array and exome sequencing data), calculated all pairwise SNP distances, and visualized the results as a heatmap along with the faces of the individuals (click for a larger version of the figure):

Number of SNP differences between PGP10 individuals

Individual #10 stands out as being genetically most dissimilar from everyone else, which is unsurprising as he is the only African American in the study. I next tried to similarly define the genetically most average individual, that is the individual that is most similar to everyone else. If one defines this as the individual with the lowest sum of differences, the answer is individual #7. However, because the origins of his grandparents are unknown, it is difficult to conclude anything interesting based on this.

Analysis: Three-dimensional DNA structure

A few months ago Bill Noble’s lab at University of Washington published a letter in Nature on a three-dimensional model of the complete nuclear genome of budding yeast:

A three-dimensional model of the yeast genome

Layered on top of information conveyed by DNA sequence and chromatin are higher order structures that encompass portions of chromosomes, entire chromosomes, and even whole genomes. Interphase chromosomes are not positioned randomly within the nucleus, but instead adopt preferred conformations. Disparate DNA elements co-localize into functionally defined aggregates or ‘factories’ for transcription and DNA replication. In budding yeast, Drosophila and many other eukaryotes, chromosomes adopt a Rabl configuration, with arms extending from centromeres adjacent to the spindle pole body to telomeres that abut the nuclear envelope. Nonetheless, the topologies and spatial relationships of chromosomes remain poorly understood. Here we developed a method to globally capture intra- and inter-chromosomal interactions, and applied it to generate a map at kilobase resolution of the haploid genome of Saccharomyces cerevisiae. The map recapitulates known features of genome organization, thereby validating the method, and identifies new features. Extensive regional and higher order folding of individual chromosomes is observed. Chromosome XII exhibits a striking conformation that implicates the nucleolus as a formidable barrier to interaction between DNA sequences at either end. Inter-chromosomal contacts are anchored by centromeres and include interactions among transfer RNA genes, among origins of early DNA replication and among sites where chromosomal breakpoints occur. Finally, we constructed a three-dimensional model of the yeast genome. Our findings provide a glimpse of the interface between the form and function of a eukaryotic genome.

Having previously worked with predicted 3D structure of DNA, such as intrinsic curvature, I was intrigued by the availability of a 3D structure of a complete eukaryotic genome. Based on past analyses of 1D distances in DNA, I expected that the 3D distance between two genes in the genome would correlate with expression, protein interactions, and metabolic pathways.

To test if 3D neighborhood correlates with function and/or regulation, I collected three large sets of protein pairs, namely pairs of co-expressed genes from the STRING database (Pearson correlation coefficient >0.7), interacting protein pairs from the BioGRID database, and pairs of genes assigned to the same pathway by the KEGG database. I subsequently mapped these onto the set of 3D neighbors listed in the supplementary information of the paper, including only 3D neighbors on different chromosomes (in order to eliminate correlations caused by 1D rather than 3D distance). I also mapped the three sets of gene pairs onto a shuffled version of the 3D neighbors, in order to estimate the overlaps that can be expected at random. The results are summarized in the table below:

3D neighbors Shuffled neighbors
Coexpressed (STRING) 58 61
Interacting (BioGRID) 2151 2122
Same pathway (KEGG) 357 344

To make a long story short, the numbers show that 3D genomic neighbors appear to be no more likely to be coexpressed, to interact, or to be involved in the same pathway than random pairs. It could be that they way I perform the analysis is too simplistic or that the data are too noisy to show a signal. However, it is also possible that the 3D structural organization of the genome simply doesn’t have much impact on gene regulation and function.

Analysis: Half of published URLs are dysfunctional a decade later

As a small aside when setting up a local mirror of Medline, I extracted 15,915 URLs that were mentioned in the abstracts. Checking them revealed that 12,354 of them (78%) were functional, which may not seem that bad. However, plotting the percentage of dysfunctional URLs as a function of publication year reveals a less pleasant trend:

Dysfunctional URLs

After just 10 years, half of all published URLs are no longer functional, and do not redirect to the new location of the service (if one exists). The fairly high success rate overall is merely a consequence of most URLs having been published within the last few years. Unless the persistence of URLs is improving (which I see no sign of in the plot), we can thus expect to have thousands of URLs in the published literature that are no longer valid.

Edit: Andrew Lang pointed out a similar study of URLs cited in communications journals.

Edit: Duncan Hull pointed out a paper on URL decay in Medline by Jonathan Wren, which reminded me of an even earlier paper on the topic.