Tag Archives: visualization

Update: The BuzzCloud for 2009

It is that time of the year again: NCBI has rolled out the new PubMed baseline, and it is my pleasure to present you with the latest and greatest of biomedical buzzwords. I present to you the BuzzCloud 2009 (click for a larger interactive version):

In case you have no idea what a BuzzCloud is, it is a visualization of some of the most trendy words in PubMed. To make a long story short, the size of the word represents how many times it was mentioned in the past, whereas the brightness represents how much it was mentioned in the year compared to the previous ten years. For more details, please refer to the original blog post.

The three largest words on the BuzzCloud 2009 are all reruns from earlier years: metagenomics and synthetic biology were both first seen on the BuzzCloud 2004) and click chemistry appeared in 2006. One can only conclude that these research areas continue to grow.

At the other end of the scale we have the small and bright words. These are the words that are rising most rapidly but have not appeared that many times in PubMed yet. Below are three selected examples that I think may be of particular interest to the readership of this blog.

  • Personal genomics. No surprise here except that I expected this word would have turned up much earlier considering the broad publicity of the 1000 Genomes Project and the Personal Genome Project.
  • Proteogenomics. Why we need a separate word for referring to the combination of proteomics and genomics is beyond me. There is even a paper on comparative proteogenomics published in Genome Research. One can only wonder when someone will compare metabolomics, proteomics, transcriptomics, and genomics data across environmental samples and coin the term comparative metametaboproteotranscriptogenomics.
  • Translational bioinformatics. Where bioinformatics meets clinical medicine (see blog post by Russ Altman). I think that bioinformaticians are indeed increasingly working on medically relevant data, which in my view is a good thing. It just makes me wonder what happened to medical informatics?

On a closing note, I am again pleasantly surprised how well the words picked up by a completely automated procedure fit with the ongoing activities in my lab. It is almost eerie.

Resource: Second Life Interactive Dendrogram Rezzer (SLIDR)

About half a year ago, I began experimenting with Second Life as a tool for virtual conferences (I should add that my experiences have since improved). However, I believe that imitating real life in a virtual world is not necessarily the best way to use the technology – it may be better to use virtual reality for doing the things that are difficult to do in the real world. A good example of this is Hiro’s Molecule Rezzer, which is one of the best known scientific tools in Second Life. It, and its much improved successor Orac, allows people to easily construct molecular models of small molecules in Second Life.

After speaking with several other researchers in Second Life, who like I are interested in evolution, I set out to build a similar tool for visualization of phylogenetic trees. The result is SLIDR (Second Life Interactive Dendrogram Rezzer), which based on a tree in Newick format constructs a dendrogram object. The first version of SLIDR can handle trees both with and without branch lengths; however, I have not yet implemented support for labels on internal nodes or for bootstrap values.

The picture below shows an example of a dendrogram that was automatically generated by SLIDR based on a Newick tree:

SLIDR closeup

There is a bit more to SLIDR than this, though. After the dendrogram has been built, it can be loaded with a photo and/or a sound for each of the leaf nodes. When click on a node, the corresponding sound will be played and the photo will be shown on the associated screen (the white box in front of which I stand):

SLIDR posing

I plan to work with collaborators in Second Life to construct dendrograms for evolution of bats (including their echolocation sounds and photos of the animals) and for the fully sequenced Drosophila genomes. Please do hesitate to contact me if you would like to use SLIDR on another project. I intend to make SLIDR available as open source software once I have implemented support for the full Newick format.

WebCiteCite this post

Resource: STRING v8.1

After months of hard work from the entire STRING team – thanks everyone –  I am pleased to be able to say that STRING v8.1 has now been put into production. Here is a screen shot of the start page:

STRING 8.1 start page

This is a minor release of STRING, which means that the imported databases of microarray expression data, protein interactions, genetic interactions, and pathways as well as text-mining evidence have all been updated. We have also fixed a bug that affected the minority of bacteria that have multiple chromosomes.

Another notable feature of STRING v8.1 is the new interactive network viewer that is implemented in Adobe Flash:

STRING 8.1 network viewer

For further details please see the post on the official STRING/STITCH blog.

WebCiteCite this post

Update: The BuzzCloud for 2008

Yes, it is that time of the year again – we are now almost three weeks into 2009, most papers published in 2008 have hopefully made it into Medline, and it is time to reveal the words of 2008. In other words, I have updated the BuzzCloud resource and here is the result for 2008 (click on the image to go to the web resource):

BuzzCloud 2008

I am thrilled to see the outcome. Without any cheating or tweaking, several buzzwords related to proteomics make it on the list with “phosphoproteomics” and “quantitative phosphoproteomics” being the two most prominent of them. Nice for me to see considering that my new research group at the Novo Nordisk Foundation Center for Protein Research will focus heavily on improving and applying the NetworKIN and NetPhorest resources for analysis of phosphoproteomics data.

Commentary: Summarizing papers as word clouds

For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.

When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):

All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that Cyclebase.org is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.

As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.

WebCiteCite this post

Resource: The BuzzCloud visualization of buzzwords

“Oh, you work on systems biology? So do I!”

New buzzwords to describe scientific disciplines and technologies seem to pop up every year. For the fun of it, I have developed a small web resource, BuzzClouds, that provides a visual overview of the latest buzzwords in biomedicine.

Without destroying your weekend with mathematical formulas, here is how the BuzzCloud selection and visualization method works:

  • A list of potential buzzwords is constructed by extracting all one- and two-word phrases ending on -ics, -ology, -omy, -phy, -chemistry, -medicine, or -sciences. These endings were select to get buzzwords that correspond to scientific disciplines and technologies.
  • The potential buzzwords are ranked according to a score that takes into account their frequencies within the past year and within the preceding decade (for details see this review article). To get a high score, a buzzword must be both frequent and new. The top-50 buzzwords are included in the cloud.
  • The size of each buzzword is proportional to the logarithm of its frequency during the past year. Common buzzwords are thus large where as rare buzzwords are small.
  • The brightness of each buzzword shows the frequency of the buzzword within the past year relative to the preceding decade. New buzzwords are thus bright whereas the older ones are darker.
  • Finally, each buzzword is assignd a tint that goes from yellow via white to cyan based on how often it occurs in scientific journals (yellow) as opposed to medical journals (cyan).

When run for the year 2007, the end result looks like this (BuzzClouds for other years are available from the web resource):

50 buzzwords identified based on Medline abstracts from 2007

I think the method does a pretty decent job despite the occasional mistakes such as nice technology and timely topics. In terms of scientific buzzwords, quantitative proteomics is booming, systems biology still hot although it is getting a bit long in the tooth, and synthetic biology is rapidly gaining popularity. And nanotechnology seems to be popular within the medical domain, giving rise to buzzwords like nanomedicine and nanotherapeutics.

Maybe I should write a buzzword-compliant, interdisciplinary grant application that combines click chemistry and synthetic biology to develop novel nanotherapeutics.

WebCiteCite this post