Posts Tagged ‘text mining’

Update: The BuzzCloud for 2008

January 19, 2009

Yes, it is that time of the year again – we are now almost three weeks into 2009, most papers published in 2008 have hopefully made it into Medline, and it is time to reveal the words of 2008. In other words, I have updated the BuzzCloud resource and here is the result for 2008 (click on the image to go to the web resource):

BuzzCloud 2008

I am thrilled to see the outcome. Without any cheating or tweaking, several buzzwords related to proteomics make it on the list with “phosphoproteomics” and “quantitative phosphoproteomics” being the two most prominent of them. Nice for me to see considering that my new research group at the Novo Nordisk Foundation Center for Protein Research will focus heavily on improving and applying the NetworKIN and NetPhorest resources for analysis of phosphoproteomics data.

Commentary: Summarizing papers as word clouds

June 27, 2008

For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.

When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):


All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that Cyclebase.org is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.

As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.

WebCiteCite this post

Commentary: Does size matter?

May 6, 2008

I recently took a look at colonization of titles and found that the fraction of papers with colons in their titles is increasing steadily. Intuitively, one would thus expect that the average length of the titles has also increased. The plot below shows that this is indeed the case (not that the y-axis does not begin at zero):

The average title length has increased from 8.5 words in 1950 to 12.5 words in 2008. Strangely, the increase is almost perfectly linear except for a fluctuation in the early 60s – I have no idea why this is the case.

But is the title length of a paper important? I personally expected that papers with short, catchy titles would be cited more than papers with longer, more complex titles. Lacking citation information for individual publications, I thus calculated average title length for publications from each journal and correlated it with the ISI impact factor of the corresponding journal:

No correlation is observed between the impact factor of a journal and the average title length of the papers published therein. So we can conclude that – at least for titles of scientific papers – size does not matter.

WebCiteCite this post

Commentary: Colonization of titles

April 22, 2008

You have probably noticed that a high fraction of scientific papers have colons in their titles. Several people have written humorous commentaries on this. Although these authors clearly see the use of colons as a growing trend, they did not present hard evidence for the increase in the usage of colons in the titles of scientific publications.

Out of curiosity, I thus wrote a small script to count the fraction of papers in Medline that have colons in their titles for each of the past 25 years. The result is shown in the plot below (note that the y-axis does not start at zero):

The conclusion is very clear: the fraction of titles with colons has increased linearly from 15% to 24% over the past 20 years. One could object that this effect may be explained by the increase in apologies (which often have a title “Retraction: …”) or by the NAR special issues on databases and web servers (which contain hundreds papers with titles such as “YADB: yet another database”). However, these add up to less than 2% of the papers with colonized titles and are thus insufficient to explain the observed 9% increase.

WebCiteCite this post

Resource: The BuzzCloud visualization of buzzwords

February 29, 2008

“Oh, you work on systems biology? So do I!”

New buzzwords to describe scientific disciplines and technologies seem to pop up every year. For the fun of it, I have developed a small web resource, BuzzClouds, that provides a visual overview of the latest buzzwords in biomedicine.

Without destroying your weekend with mathematical formulas, here is how the BuzzCloud selection and visualization method works:

  • A list of potential buzzwords is constructed by extracting all one- and two-word phrases ending on -ics, -ology, -omy, -phy, -chemistry, -medicine, or -sciences. These endings were select to get buzzwords that correspond to scientific disciplines and technologies.
  • The potential buzzwords are ranked according to a score that takes into account their frequencies within the past year and within the preceding decade (for details see this review article). To get a high score, a buzzword must be both frequent and new. The top-50 buzzwords are included in the cloud.
  • The size of each buzzword is proportional to the logarithm of its frequency during the past year. Common buzzwords are thus large where as rare buzzwords are small.
  • The brightness of each buzzword shows the frequency of the buzzword within the past year relative to the preceding decade. New buzzwords are thus bright whereas the older ones are darker.
  • Finally, each buzzword is assignd a tint that goes from yellow via white to cyan based on how often it occurs in scientific journals (yellow) as opposed to medical journals (cyan).

When run for the year 2007, the end result looks like this (BuzzClouds for other years are available from the web resource):

50 buzzwords identified based on Medline abstracts from 2007

I think the method does a pretty decent job despite the occasional mistakes such as nice technology and timely topics. In terms of scientific buzzwords, quantitative proteomics is booming, systems biology still hot although it is getting a bit long in the tooth, and synthetic biology is rapidly gaining popularity. And nanotechnology seems to be popular within the medical domain, giving rise to buzzwords like nanomedicine and nanotherapeutics.

Maybe I should write a buzzword-compliant, interdisciplinary grant application that combines click chemistry and synthetic biology to develop novel nanotherapeutics.

WebCiteCite this post

Commentary: We apologize

February 17, 2008

Attila Chordash over at “PIMM – Partial immortalization” discovered that Proteomics have now changed the abstract of the infamous paper by Warda and Han to be an apology to their readership:

Proteomics apologizes

While I am pleased to see this public apology from the publisher, the retraction is still only based on “a substantial overlap of the content of this article with previously published articles in other journals”. That is a euphemism for “the authors copied four entire pages of text from sources that were not cited”. However, I am concerned that this apology – like the press release from Proteomics – ignores the central question: how did the manuscript make it through peer review?

I was a bit surprised to see an apology being published via PubMed, but a quick search revealed that Proteomics is far from the only journal to apologize to their readers in this way. In fact, a systematic count of the abstracts mentioning the words “apologise(s)” or “apologize(s)” has increased exponentially over the past decade (note the logarithmic scale):

Exponential increase in the number of apologies

The number shown for 2008 is an extrapolation based on the first six weeks; if the apologies keep coming at the current rate, there will be 32 by the end of the year. The line shows an exponential fit of the data points from 1999 to 2007. The doubling time for the number of apologies is just 3 years whereas the number of papers doubles only every 22 years. If these trends continue, there will be more apologies than papers published from the year 2067 and onwards. I apologize for the extrapolation.

WebCiteCite this post

Analysis: The law of diminishing returns

February 11, 2008

The law of diminishing returns is a well known concept in economics. Highly simplified, it states that as you invest more, the overall return on investment increases at a declining rate. I wondered if this principle applies to biomedical research.

I thus wrote a small script to parse the Medline database and count for each year 1) the number of new papers published, 2) the number of authors that published at least one paper, and 3) the total number of (co-)authorships. The plot below shows the number of new papers and the number of active authors for each year since 1970:

Exponential growth in the number of papers and authors

Few scientists – if any – will be surprised to see that the rate of publication and the number of active publishing scientists have increased exponentially. However, it is slightly disconcerting that the number active authors doubles every 17 years whereas the number of papers per year doubles only every 22 years.

To look deeper into this, I plotted as function of time the average number of coauthors per paper and the average number of papers coauthored by each active author:

Exponential increase in the number of authorships per paper and per author

These two measures also appear to increase exponentially. However, the number of coauthors per paper is increasing considerably faster than the number of papers coauthored by each author per year. The estimated doubling times are 33 years for the number of coauthors per paper and 63 years for the number of papers coauthored. This suggests that the productivity of biomedical scientists, measured in terms of publications, has decreased.

A more direct way to show this is to plot the ratio between the number of papers published each year and the number of authors on them (note that the y-axis does not start at zero):

The productivity in terms of papers is decreasing

The fact is that the number of papers produced per researcher per year has dropped by roughly one third since 1970. However, there could be many reasons for this:

  • Have we simply become lazy?
  • Has the bar been raised for what is considered the Least Publishable Unit?
  • Are large collaborations less efficient than smaller projects?
  • Do we spend more time on bureaucracy and less time on science?
  • Or are we left with the hard questions because the easy ones have all been answered?

My guess is that the last three reasons all play important roles. What do you think?

WebCiteCite this post