Commentary: Summarizing papers as word clouds

For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.

When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):

All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.

As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.

WebCiteCite this post

10 thoughts on “Commentary: Summarizing papers as word clouds

  1. spitshine

    You could add stemming too, “proteins” and “protein” should be the same, shouldn’t they? Actually, the third paper comes out quite OK from my point of view.

  2. Lars Juhl Jensen Post author

    Good idea! The word clouds could be simplified by stemming the words and possibly also by merging autographic variants (e.g. “cell-cycle” and “cell cycle”).

  3. pedrobeltrao

    Did you find a way to retrieve the images automatically ? I tried using it to represent authors or labs. Take the last few abstracts from a group to visualize re-occurring themes. It could be nice to put on the lab webpage but I read somewhere that the code was not available and I did not find a way to get it automatically.

  4. i9606

    I was also impressed by wordle, its very pretty. Here is one of my papers tagified. Out lab built a few, now somewhat aging, tools for generating tag clouds for things like query responses to pubmed and connotea and information about proteins in ihop. Perhaps you will find them entertaining.

  5. dvizard

    I don’t impart your enthusiasm. I, for example, not knowing STITCH and NetworKIN, could not infer anything from the 2 wordles other than the general theme they are about. For you, who already knows what the paper is about, it is of course obvious that stitch IS A database OF interactions BETWEEN proteins AND chemicals, but the fact that wordle does not encode semantic relations between the words narrows its scope and usefulness. I don’t see the big difference or advantage of Wordle over simple keywords or tags, where additionally the author can make sure he is not misinterpreted by an algorithmic analysis of his work.

    For example, reading the first tag cloud I thought STITCH could have been a novel protein, whose interactions with different chemicals were examined in the paper.

    Just to put the general euphoria into relation.

  6. Lars Juhl Jensen Post author

    dvizard, I think that you may be reading a bit more enthusiasm into my post than I intended. Me being euphoric is normally not accompanied by phrase such as “does a pretty good job” and “the result is rather less impressive” ;-)

    I fully acknowledge your point, though, that it is obviously easier to interpret the word cloud for a paper when you have already read the paper – not to mention when you have written it.

    We also fully agree that looking at a word cloud is no substitute for reading a paper. However, I think that with improvements in the visualization, it could be an alternative to rapidly looking through a pile of abstracts. My question is rather what can convey most of the content of a paper in 10 seconds: a word cloud or an abstract?

  7. Pingback: Browsing clouds, not papers

  8. Pingback: Browsing clouds, not papers « A Man With A Ph.D.

  9. Pingback: Recent Links Tagged With "textmining" - JabberTags

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s