Tag Archives: publishing

Commentary: The sad tale of MutaDATABASE

The problem of bioinformatics web resources dying or moving is well known. It has been quantified in two interesting papers by Jonathan Wren entitled “404 not found: the stability and persistence of URLs published in MEDLINE” and “URL decay in MEDLINE — a 4-year follow-up study”. There is also a discussion on the topic at Biostar.

The resources discussed in these papers at least existed in an operational form at the time of publication, even if they have since perished. The same cannot be said about MutaDATABASE, which in 2011 was published in Nature Biotechnology as a correspondence entitled “MutaDATABASE: a centralized and standardized DNA variation database”. Fellow blogger Neil Saunders was quick to pick up on the fact that this database was an empty shell, but generously gave the authors the benefit of the doubt in his closing statement:

Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.

Now, almost five years after the original publication, I think it is fair to follow up. Unfortunately, MutaDATABASE did not turn out to be terrific. Instead, it turned out just not to be. In March 2014, about three years after the publication, www.mutadatabase.org looked like this:
MutaDATABASE in 2014

By the end of 2015, the website had mutated into this:
MutaDATABASE in 2015

To quote Joel Spolsky: “Shipping is a feature. A really important feature. Your product must have it.” This also applies to biological databases and other bioinformatics resources, which is why journals would be wise never to publish any resource without this crucial feature.

Analysis: Does a publication matter?

This may seem a strange question to ask for someone working in academia – of course a publication matters, especially if it is cited a lot. However, when it comes to publications about web resources, publications and citations in my opinion mainly serve as somewhat odd proxies on my CV for what really matters: the web resources themselves and how much they are used.

Still, one could hope that a publication about a new web resource would make people aware of its existence and thus attract more users. To analyze this, I took a look at the user statistics of our recently developed resource COMPARTMENTS:

COMPARTMENTS user statistics

Before publishing a paper about it, the web resource had less than 5 unique users per day. Our paper about the resource was accepted on January 26 in the journal Database, which increased the usage to about 10 unique users on a typical weekday. The spike of 41 unique users in a single day was due to me teaching on a course.

So what happened end of June that gave a more than 10-fold increase in the number of users from one day to the next? A new version of GeneCards was released with links to COMPARTMENTS. It seems safe to conclude that the peer-reviewed literature is not where most researchers discover new tools.

Commentary: The 99% of scientific publishing

Last week, John P. A. Ioannidis from Stanford University and Kevin W. Boyack and Richard Klavans from SciTech Strategies, Inc published an interesting analysis of scientific authorships. In the PLOS ONE paper “Estimates of the Continuously Publishing Core in the Scientific Workforce” they describe a small influential core of <1% of researchers who publish each and every year. This analysis appears to have caught the attention of many, including Erik Stokstad from Science Magazine who wrote the short news story “The 1% of scientific publishing”.

You would be excused to think that I belong to the 1%. I published my first paper in 1998 and have published at least one paper every single year since then. However, it turns out that the 1% was defined as the researchers who had published at least one paper every year in the period 1996-2011. Since I published my first paper in 1998, I belong to the other 99% together with everyone else who started their publishing career after 1996 or stopped their career before 2011.

Although the number 1% is making the headlines, the authors seem to be aware of the issue. Of the 15,153,100 researchers with publications in the period 1996-2011, only 150,608 published all 16 years; however, the authors estimate an additional 16,877 scientists published every year in the period 1997-2012. A similar number of continuously publishing scientists will have started their careers all the other years from 1998-2011. Similarly, they an estimated 9,673 researchers stopped their long continuous publishing career in 2010, and presumably all other years in the period 1996-2009. In my opinion, a better estimate is thus that 150,608 + 15*16,877 + 15*9,673 = 548,858 of the 15,153,100 authors have had or will have a 16-year unbroken chain of publications. That amounts to something in the 3-4% range.

That number may still not sound impressive; however, this in no way implies that most researchers do not publish on a regular basis. To have a 16-year unbroken chain of publications, one almost has to stay in academia and become a principal investigator. Most people who publish at least one article and subsequently pursue a career in industry or teaching will count towards the 96-97%. And that is no matter how good a job they do, mind you.

Analysis: When will your BMC paper be typeset?

One month ago, people from Jan Gorodkin’s group and my own group published a paper in BMC Systems Biology. This happened after a very long process during which we were very close to retracting the manuscript due to inaction by the editor and sending it elsewhere. In the end it got accepted, but even now there is only the provisional PDF available. The paper has still not been typeset.

Typesetting is one of very few things an online-only journal does to add value. Publishers often claim to add value by organizing peer review, but if you think about it, they pass the manuscript to an unpaid editor who subsequently recruits unpaid referees to review it. Careful copyediting and typesetting of the final, accepted manuscript is thus in my view the only hands-on work that most journals do for their considerable article-processing charge. Neil Saunders’ recent blog post “We really don’t care what statistical method you used” illustrates well the care with which copy editing is done. We are thus down to only one service actually done by the publishers: typesetting the manuscript to produce XML, HTML, and PDF versions of it.

You would thus hope that typesetting at least happens promptly once a manuscript is accepted and the authors have paid. However, I have been frustrated to find that both my own manuscript in BMC Systems Biology and many manuscripts that I have downloaded from BMC journals exist only as provisional PDFs even months after publication. I thus decided to quantify to which extent typesetting of papers is delayed. To this end, I considered all papers published in each journal during the months May-July this year and calculated which percentage of them had been typeset by now.

Starting with BMC Systems Biology, here are the numbers: 7 of 26 papers from May, 3 of 24 papers from June, and 1 of 15 papers from July have been typeset to date. The numbers for BMC Bioinformatics turned out to be as disappointing: 6 of 52, 7 of 36 and 1 of 32 papers from May, June, and July have been typeset so far. And BMC Genomics confirmed the trend: 17 of 56, 14 of 74, and 11 of 67 are the numbers for May, June, and July. This adds up to only 16.9%, 10.6%, and 21.3% of papers from May-July having been typeset by BMC Systems Biology, BMC Bioinformatics, and BMC Genomics, respectively.

I continued to check other journals from BioMed Central, Chemistry Central, and SpringerOpen journals, which all are open access journals owned by Springer. The results were the same. The percentages of papers from May-July that had been typeset were 6.2%, 20.0%, and 9.0% for Proteome Science, Chemistry Central Journal, and Critical Ultrasound Journal, respectively.

To make a long, depressing story short, I should expect to wait for at least another three months before I see a typeset version of my paper. Can someone please remind me why we, the researchers, pay for this?

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Is PeerJ cheaper than other Open Access journals?

The newly announced Open Access journal PeerJ has caused quite a fuzz, not least because of their catch phrase: “If we can set a goal to sequence the human genome for $99, then why not $99 for scholarly publishing?”

This at first sounds very cheap; however, the $99 is not what you pay per accepted paper. PeerJ operates under a different scheme than traditional Open Access journals: instead of paying per publication, you pay a one-time fee that you pay to be able to publish in PeerJ for life. This sounds almost too good to be true.

There are a few catches, however. Firstly, $99 only entitles you to submit one manuscript per year to PeerJ. If you want to be able to submit two manuscripts per year or unlimited manuscripts, the price rises to $169 and $259 respectively.
Secondly, all authors on a manuscript must be paying PeerJ members at the time of submission (except if there are more than twelve authors, in which case it is enough that 12 of them are members). This suddenly makes the comparison to other Open Access journals much more complex, as the actual average price per manuscript depends on the number of authors, the number of other PeerJ manuscripts submitted by the same authors in their lifetime, and the acceptance rate of PeerJ. In this post I try to do the math and compare PeerJ to traditional Open Access journals, where you pay per accepted publication.

PeerJ compares itself to PLoS ONE, so I base all comparisons on that. From 2006 when PLoS ONE was launched up to and including 2011, a total of 29,042 publications have appeared with a total of 150,020 authorships. This amounts to an average of 5.1 authors per publication. When PeerJ is initially launched, no authors will have the benefit of already being members, so at first this implies that all authors will have to pay an average cost of $99*5.1 = $511 per submitted manuscript (ignoring the discount on manuscripts with 12+ authors). According to the PeerJ FAQ, this is expected to be approximately 70%. Assuming that this holds true, the average cost incurred by the authors per accepted paper will be $511/0.7 = $730. This is already considerably less than PLoS ONE, which has a publication fee of $1350 per accepted paper. From a pure cost point-of-view, PeerJ thus looks to be about half the price of PLoS ONE.

I do have some concerns related to the model of charging per author. First, I find it to be illogical, since the actual costs related to handing a manuscript are independent of the number of authors. Second, the average number of authors per paper varies between research fields, which implies that the average fee per manuscript will in some fields be higher than $730. For a manuscript with 12 authors, neither of whom are already PeerJ members, the fee per accepted manuscript is $99*12/0.7 = $1697, which is more expensive than PLoS ONE. Third, the new model gives a direct financial incentive to not include authors who made minor contributions.

In summary, I think PeerJ is a refreshing new idea – I can only applaud efforts to lower the price of scientific publishing. However, although $99 for scientific publishing sounds revolutionarily cheap, PeerJ will at first only be ~2x cheaper then PLoS ONE. Also, the new payment model, which effectively boils down to a per-author charge, is in my opinion not without its own problems.

Full disclosure: I am an associate editor of PLoS Computational Biology.

Analysis: Toward doing science

Yesterday, Rangarajan and coworkers published a paper in BMC Bioinformtatics entitled “Toward an interactive article: integrating journals and biological databases”. Not many hours later Neil Saunders made the following tweet commenting on it:

Can we ban use of "toward(s)" in article titles?

This reminded me of a draft blog post that I wrote in 2008 on the use of the word “toward(s)” in article titles, and I decided that it was time to update the plot and finally publish it. The background was that I had the gut feeling that there was a somewhat disturbing trend, namely that more and more papers use these words in the title. I thus went to Medline and counted the fraction of papers from each year having a title starting with “toward” or “towards” (I also included them if towards appeared inside the title following a colon, semicolon, or dash):

The plot shows that fraction of articles with “toward(s)” in the title is rapidly rising; it has more than tripled over the past two decades. There is thus no doubt that the use of “toward(s)” in article titles is a trend in biomedical publishing.

As is often the case with statistics, though, this analysis answers only one question but leads to several new ones. Are we increasingly selling our papers on what we hope to do soon rather than on what we have actually done? Or have we just become more honest by now adding the word “toward(s)” where we might have left it out in the past?

Analysis: Correlating the PLoS article level metrics

A few months ago, the Public Library of Science (PLoS) made available a spreadsheet with article level metrics. Although others have already analyzed these data (see posts by Mike Chelen), I decided to take a closer look at the PLoS article level metrics.

The data set consists of 20 different article level metrics. However, some of these are very sparse and some are partially redundant. I thus decided to filter/merge these to create a reduced set of only 6 metrics:

  1. Blog posts. This value is the sum of Blog Coverage – Postgenomic, Blog Coverage – Nature Blogs, and Blog Coverage – Bloglines. A single blog post may obviously be picked up by multiple of these resources and hence be counted more than once. Being unable to count unique blog posts referring to a publication, I decided to aim for maximal coverage by using the sum rather than using data for only a single resource.
  2. Bookmarks. This value is the sum of Social Bookmarking – CiteULike and Social Bookmarking – Connotea. One cannot rule out that a single user bookmarks the same publication in both CiteULike and Connotea, but I would assume that most people use one or the other for bookmarking.
  3. Citations. This value is the sum of Citations – CrossRef, Citations – PubMed Central, and Citations – Scopus. I decided to use the sum to be consistent with the other metrics, but a single citation may obviously be picked up by more than one of these resources.
  4. Downloads. This value is called Combined Usage (HTML + PDF + XML) in the original data set and is the sum of Total HTML Page Views, Total PDF Downloads, and Total XML Downloads. Again the sum is used to be consistent.
  5. Ratings. This value is called Number of Ratings in the original data set. Because of the small number of articles with rating, notes, and comments, I decided to discard the related values Average Rating, Number of Note threads, Number of replies to Notes, Number of Comment threads, Number of replies to Comments, and Number of ‘Star Ratings’ that also include a text comment.
  6. Trackbacks. This value is called Number of Trackbacks in the original data set. I was greatly in doubt whether to merge this into the blog post metric, but in the end decided against doing so because trackbacks do not necessarily originate from blog posts.

Calculating all pairwise correlations among these metrics is obviously trivial. However, one has to be careful when interpreting the correlations as there are at least two major confounding factors. First, it is important to keep in mind that the PLoS article level metrics have been collected across several journals. Some of these journals are high impact journals such as PLoS Biology and PLoS Medicine, whereas others are lower impact journals such as PLoS ONE. One would expect that papers published in the former two journals will on average have higher values for most metrics than the latter journal. Papers published in journals with a web-savvy readership, e.g. PLoS Computational Biology, are more likely to receive blog posts and social bookmarks. Second, the age of a paper matters. Both downloads and in particular citations accumulate over time. To correct for these confounding factors, I constructed a normalized set of article level metrics, in which each metric for a given article was divided by the average for articles published the same year in the same journal.

I next calculated all pairwise Pearson correlation coefficients among the reduced set of article level metrics. To see the effect of the normalization, I did this for both the raw and the normalized metrics. I visualized the correlation coefficients as a heat map, showing the results for the raw metrics above the diagonal and the results for the normalized metrics below the diagonal.

There are a several interesting observations to be made from this figure:

  • Downloads correlate strongly with all the other metrics. This is hardly surprising, but it is reassuring to see that these correlations are not trivially explained by age and journal effects.
  • Bookmarks is the metric that apart from number of downloads correlates most strongly with Citations. This makes good sense since CiteULike and Connotea are commonly used as reference managers. If you add a paper to you bibliography database, you will likely cite it at some point.
  • Blog posts and Trackbacks correlate well with Downloads but poorly with citations. This may reflect that blog posts about research papers are often targeted towards a broad audience; if most of the readers of the blog posts are laymen or researchers from other fields, they will be unlikely to cite the papers covered in the blog posts.
  • Ratings correlates fairly poorly with every other metric. Combined with the low number of ratings, this makes me wonder if the option to rate papers on the journal web sites is all that useful.

Finally, I will point out one additional metrics that I would very much like to see added in future versions of this data set, namely microblogging. I personally discover many papers through others mentioning them on Twitter or FriendFeed. Because of the much smaller the effort involved in microblogging a paper as opposed to writing a full blog post about it, I suspect that the number of tweets that link to a paper would be a very informative metric.

Edit: I made a mistake in the normalization program, which I have now corrected. I have updated the figure and the conclusions to reflect the changes. It should be noted that some comments to this post were made prior to this correction.