Monthly Archives: June 2008

Commentary: Summarizing papers as word clouds

For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.

When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):


All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that Cyclebase.org is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.

As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.

WebCiteCite this post

Analysis: Degradation signals correlate with protein half-life

I yesterday blogged about how the protein half-life data from the O’Shea lab fit well with my earlier analyses of transcriptional regulation during the budding yeast cell cycle and with the just-in-time assembly hypothesis. However, I have now realized that the same data set can be used to test the validity of the sequence-based predictions of protein degradation signals that I relied on for the cell-cycle study.

To this end, I divided the budding yeast proteome into six groups: proteins with a D-box, proteins without a D-box, proteins with a KEN-box, proteins without a KEN-box, proteins with a PEST region, and proteins without a PEST region. For each of these six groups of proteins, I simply plotted the distribution of protein half-lives as a histogram:

The figure shows that for all three degradation signals, proteins with the sequence motif tend to have shorter half-lives than proteins without the motif. These differences are all statistically significant according to the Mann-Whitney U test (D-box, P < 10-6; KEN-box, P < 0.02; PEST region, P < 10-15). It is noteworthy that the KEN-box motif gives a far weaker correlation with protein half-live than the two other degradation signals, as it was also the only degradation signal that did not correlate with transcriptional cell-cycle regulation in budding yeast (see supplementary information of Jensen et al., 2006).

In summary, proteins that contain putative degradation signals have significantly shorter half-lives than proteins that do not contain such signals. The only caveat is that long sequences are more likely to match the sequence motifs, and that O’Shea and colleagues found a negative correlation between sequence length and protein half-life. The correlations described here could thus be a secondary effect; however, it is also possible that the presence of degradation signals in long sequences is the missing explanation for their short half-lives.

WebCiteCite this post

Analysis: Cell-cycle-regulated genes encode short-lived proteins

In relation to an entirely different analysis than the one I will describe here, I downloaded the protein half-life data for budding yeast that was published in PNAS by the O’Shea lab about two years ago:

Quantification of protein half-lives in the budding yeast proteome

A complete description of protein metabolism requires knowledge of the rates of protein production and destruction within cells. Using an epitope-tagged strain collection, we measured the half-life of >3,750 proteins in the yeast proteome after inhibition of translation. By integrating our data with previous measurements of protein and mRNA abundance and translation rate, we provide evidence that many proteins partition into one of two regimes for protein metabolism: one optimized for efficient production or a second optimized for regulatory efficiency. Incorporation of protein half-life information into a simple quantitative model for protein production improves our ability to predict steady-state protein abundance values. Analysis of a simple dynamic protein production model reveals a remarkable correlation between transcriptional regulation and protein half-life within some groups of coregulated genes, suggesting that cells coordinate these two processes to achieve uniform effects on protein abundances. Our experimental data and theoretical analysis underscore the importance of an integrative approach to the complex interplay between protein degradation, transcriptional regulation, and other determinants of protein metabolism.

The idea that transcriptional regulation goes hand-in-hand with protein degradation is fully consistent with the just-in-time assembly hypothesis. I thus examined the distributions of protein half-lives for dynamic (i.e. periodically expressed) and static (i.e. not periodically expressed) proteins:

The histogram suggests that dynamic proteins are shifted towards shorter half-lives relative to static proteins. The difference is indeed statistically significant according to the Mann-Whitney U test (P < 10-4). This result supports the sequence-based observation that dynamic proteins contain more D-box, KEN-box, and PEST degradation signals than static proteins.

I next tested if the half-life of the dynamic proteins varies during the cell cycle by make scatter plot of the protein half-life as function of the time of peak expression for the corresponding mRNA:

There appears to be no correlation. Together, these analyses indicate that dynamic proteins have shorter half-lives than static proteins, irrespective of when in the cell cycle they are expressed.

WebCiteCite this post