The law of diminishing returns is a well known concept in economics. Highly simplified, it states that as you invest more, the overall return on investment increases at a declining rate. I wondered if this principle applies to biomedical research.
I thus wrote a small script to parse the Medline database and count for each year 1) the number of new papers published, 2) the number of authors that published at least one paper, and 3) the total number of (co-)authorships. The plot below shows the number of new papers and the number of active authors for each year since 1970:
Few scientists – if any – will be surprised to see that the rate of publication and the number of active publishing scientists have increased exponentially. However, it is slightly disconcerting that the number active authors doubles every 17 years whereas the number of papers per year doubles only every 22 years.
To look deeper into this, I plotted as function of time the average number of coauthors per paper and the average number of papers coauthored by each active author:
These two measures also appear to increase exponentially. However, the number of coauthors per paper is increasing considerably faster than the number of papers coauthored by each author per year. The estimated doubling times are 33 years for the number of coauthors per paper and 63 years for the number of papers coauthored. This suggests that the productivity of biomedical scientists, measured in terms of publications, has decreased.
A more direct way to show this is to plot the ratio between the number of papers published each year and the number of authors on them (note that the y-axis does not start at zero):

The fact is that the number of papers produced per researcher per year has dropped by roughly one third since 1970. However, there could be many reasons for this:
- Have we simply become lazy?
- Has the bar been raised for what is considered the Least Publishable Unit?
- Are large collaborations less efficient than smaller projects?
- Do we spend more time on bureaucracy and less time on science?
- Or are we left with the hard questions because the easy ones have all been answered?
My guess is that the last three reasons all play important roles. What do you think?
Cite this post
Commentary: Summarizing papers as word clouds
June 27, 2008For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.
When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):



All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that Cyclebase.org is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.
As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.
Posted in Commentary | 10 Comments »
Tags: text mining, visualization