Monthly Archives: February 2008

Resource: The BuzzCloud visualization of buzzwords

“Oh, you work on systems biology? So do I!”

New buzzwords to describe scientific disciplines and technologies seem to pop up every year. For the fun of it, I have developed a small web resource, BuzzClouds, that provides a visual overview of the latest buzzwords in biomedicine.

Without destroying your weekend with mathematical formulas, here is how the BuzzCloud selection and visualization method works:

  • A list of potential buzzwords is constructed by extracting all one- and two-word phrases ending on -ics, -ology, -omy, -phy, -chemistry, -medicine, or -sciences. These endings were select to get buzzwords that correspond to scientific disciplines and technologies.
  • The potential buzzwords are ranked according to a score that takes into account their frequencies within the past year and within the preceding decade (for details see this review article). To get a high score, a buzzword must be both frequent and new. The top-50 buzzwords are included in the cloud.
  • The size of each buzzword is proportional to the logarithm of its frequency during the past year. Common buzzwords are thus large where as rare buzzwords are small.
  • The brightness of each buzzword shows the frequency of the buzzword within the past year relative to the preceding decade. New buzzwords are thus bright whereas the older ones are darker.
  • Finally, each buzzword is assignd a tint that goes from yellow via white to cyan based on how often it occurs in scientific journals (yellow) as opposed to medical journals (cyan).

When run for the year 2007, the end result looks like this (BuzzClouds for other years are available from the web resource):

50 buzzwords identified based on Medline abstracts from 2007

I think the method does a pretty decent job despite the occasional mistakes such as nice technology and timely topics. In terms of scientific buzzwords, quantitative proteomics is booming, systems biology still hot although it is getting a bit long in the tooth, and synthetic biology is rapidly gaining popularity. And nanotechnology seems to be popular within the medical domain, giving rise to buzzwords like nanomedicine and nanotherapeutics.

Maybe I should write a buzzword-compliant, interdisciplinary grant application that combines click chemistry and synthetic biology to develop novel nanotherapeutics.

WebCiteCite this post

Analysis: Cell-cycle phenotypes and regulation, part 2

I have previously blogged about the relationship between cell-cycle phenotypes and regulation in human as well as budding yeast. I was thus excited to see the new RNAi study on cell-cycle phenotypes by Rines and coworkers that was published in Genome Biology two days ago. The title of their paper is “Whole genome functional analysis identifies novel components required for mitotic spindle integrity in human cells”, and the abstract reads as follows:


The mitotic spindle is a complex mechanical apparatus required for accurate segregation of sister chromosomes during mitosis. We designed a genetic screen using automated microscopy to discover factors essential for mitotic progression. Using a RNAi library of 49,164 double-stranded (ds)RNAs targeting 23,835 human genes, we performed a loss-of-function screen looking for siRNAs that arrest cells in metaphase.


Here we report the identification of genes that when suppressed result in structural defects in the mitotic spindle leading to bent, twisted, monopolar or multipolar spindles and cause a cell cycle arrest. We further described a novel analysis methodology for large-scale RNAi datasets which relies upon supervised clustering of these genes based on gene ontology (GO), protein families, tissue expression and protein-protein interactions.


This approach was utilized to functionally classify the identified genes in discrete mitotic processes. We confirmed the identity for a subset of these genes and examined more closely their mechanical role in spindle architecture.

The screen identified a set of 226 genes that when suppressed lead to spindle-related cell-cycle phenotypes. Using the name-mapping files from STRING, I was able to map 175 of them to the set of genes used in my other cell-cycle analyses. The results presented below are all based on this set of 175 genes.

To my surprise, Rines and coworkers did not compare their results to the earlier phenotypic screen published by Mukherji et al. in PNAS. Since I had already mapped this dataset onto the same gene set, it was easy to make a comparison of the new phenotype data from Rines et al. and the eight phenotypic categories defined by Mukherji et al.:

Category Description Overlap Significance
1 G1 small nuclear area 2/116 n.s.
2 G1 2/117 n.s.
3 S 1/61 n.s.
4 S + G2/M 4/59 P < 0.002; FDR < 1%
5 G2/M large nucleus 5/200 P < 0.019; FDR < 5%
6 G2/M 4/259 n.s.
7 G2/M + endoduplication 1/52 n.s.
8 Cytokinesis 3/36 P < 0.003; FDR < 1%

The statistical significance of the overlap was assessed using Fisher’s exact test and the false discovery rate (FDR) was calculated using the Benjamini-Hochberg method. As can be seen, the agreement between the two studies is very poor. Nonetheless, it is reassuring that the largest overlap (>8%) is observed for category 8, since spindle defects should be expected to result in problems during cytokinesis.

I also looked into the transcriptional and post-translational regulation of the 175 genes. The cell-cycle microarray study by Whitfield and coworkers covered 124 of the genes, 15 of which are periodically expressed (P < 0.002; Fisher’s exact test). Plotting the distribution of peak times for these genes confirms the observation by Rines et al. that the genes tend to be expressed around the G2/M transition and during M phase:

Peak time distributions for human genes identified by Rines et al. and Mukheriji et al.

As should be expected, the peak-time distribution for the genes identified by Rines et al. is in agreement with the corresponding distributions for categories 4, 5, and 8 from Mukherji and coworkers.

Comparison with a set of 985 phosphoproteins identified in low-throughput studies (obtained from Phospho.ELM) shows that the proteins products encoded by the 175 genes are preferentially phosphorylated (P < 0.001; Fisher’s exact test). This result is confirmed by comparisons with large mass-spectrometry studies (P < 0.03; Fisher’s exact test) and CDK substrates predicted by NetPhosK (P < 0.05; Fisher’s exact test).

Finally, I analyzed the protein products encoded by the 175 genes for degradation signals. 22 of them contain a strong D-box motif (P < 0.03; Fisher’s exact test) and 28 contain a KEN-box motif (P < 0.002; Fisher’s exact test). By contrast, the gene products identified by Rines et al. display no overrepresentation of PEST degradation signals. This makes sense since proteins with D-box and/or KEN-box motifs are polyubiquitinated by the anaphase-promoting complex (APC) during late M phase, which targets them for degradation by the proteasome.

In summary, Rines and coworkers has identified a set of genes that show weak but significant overlap with some of the phenotypic categories defined by Mukherji et al., with periodically expressed genes identified based on microarray data from Whitfield et al., with known and predicted phosphoproteins, and with predicted degradation signals. All of the results are consistent with the majority of the 175 genes functioning during G2/M and early M phase.

WebCiteCite this post

Analysis: Evolution of transcription-factor binding and cell-cycle-regulated transcription

Together with collaborators in Søren Brunak’s group, I have earlier published a comparative study on eukaryotic cell-cycle regulation. In the supplement and earlier papers, we presented benchmarks that documented the sensitivity with which periodically expressed genes can be identified based on microarray expression data. We thereby showed that the poor evolutionary conservation of transcriptional cell-cycle regulation is not an artifact of individual gene lists being unreliable.

However, there is a more direct test that we did not think of at the time, namely to check if the changes in periodic transcription agree with the binding of cell-cycle transcription factors in each organism. The first step is to select two organisms (organism 1 and organism 2) and extract two sets of genes: 1) cycling genes from organism 1 with non-cycling orthologs in organism 2 and 2) non-cycling genes from organism 1 with cycling orthologs in organism 2. Next, Fisher’s exact test is used to determine if targets of cell-cycle transcription factors are overrepresented in the first set relative to the second. This procedure is equivalent to the test for coevolution between transcriptional and postranslational regulation (see Jensen et al. (2006) for details).

I used the procedure to perform all pairwise tests for Homo sapiens, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Arabidopsis thaliana. For each choice of organism 1, I used the same set of cell-cycle transcription-factor targets also used for the original benchmarks. The table below sumarizes the results of the statistical tests; the rows specify organism 1 and columns specify organism 2:

  H. sapiens S. cerevisiae S. pombe A. thaliana
H. sapiens   P < 10-5 P < 10-9 P < 10-6
S. cerevisiae P < 10-8   P < 10-7 P < 0.01
S. pombe P < 10-4 n.s.   P < 0.01
A. thaliana P < 0.09 n.s. P < 10-4  

For most of the pairwise organism comparisons, the expected coevolution of transcription factor binding and cell-cycle-regulated transcription is supported by the statistical test. Benjamini-Hochberg correction for multiple testing was thus not performed as it would change the p-values only marginally (by a factor of 4/3 to be exact). Apart from the S. pombe vs. S. cerevisiae comparison, the weak correlations all involve A. thaliana for which only very limited microarray expression data is available.

This analysis shows that the differences in cell-cycle-regulated transcription (as measured by microarrays) are consistent with the available data on transcription-factor binding. This provides direct evidence that the poor conservation of cell-cycle regulation observed between eukaryotes is due to genuine, biological differences.

WebCiteCite this post

Commentary: We apologize

Attila Chordash over at “PIMM – Partial immortalization” discovered that Proteomics have now changed the abstract of the infamous paper by Warda and Han to be an apology to their readership:

Proteomics apologizes

While I am pleased to see this public apology from the publisher, the retraction is still only based on “a substantial overlap of the content of this article with previously published articles in other journals”. That is a euphemism for “the authors copied four entire pages of text from sources that were not cited”. However, I am concerned that this apology – like the press release from Proteomics – ignores the central question: how did the manuscript make it through peer review?

I was a bit surprised to see an apology being published via PubMed, but a quick search revealed that Proteomics is far from the only journal to apologize to their readers in this way. In fact, a systematic count of the abstracts mentioning the words “apologise(s)” or “apologize(s)” has increased exponentially over the past decade (note the logarithmic scale):

Exponential increase in the number of apologies

The number shown for 2008 is an extrapolation based on the first six weeks; if the apologies keep coming at the current rate, there will be 32 by the end of the year. The line shows an exponential fit of the data points from 1999 to 2007. The doubling time for the number of apologies is just 3 years whereas the number of papers doubles only every 22 years. If these trends continue, there will be more apologies than papers published from the year 2067 and onwards. I apologize for the extrapolation.

WebCiteCite this post

Analysis: Cell-cycle phenotypes and regulation

In 2006 the Schultz lab at the Scripps Research Institute published a paper in PNAS called “Genome-wide functional analysis of human cell-cycle regulators”. The abstract reads:

Human cells have evolved complex signaling networks to coordinate the cell cycle. A detailed understanding of the global regulation of this fundamental process requires comprehensive identification of the genes and pathways involved in the various stages of cell-cycle progression. To this end, we report a genome-wide analysis of the human cell cycle, cell size, and proliferation by targeting >95% of the protein-coding genes in the human genome using small interfering RNAs (siRNAs). Analysis of >2 million images, acquired by quantitative fluorescence microscopy, showed that depletion of 1,152 genes strongly affected cell-cycle progression. These genes clustered into eight distinct phenotypic categories based on phase of arrest, nuclear area, and nuclear morphology. Phase-specific networks were built by interrogating knowledge-based and physical interaction databases with identified genes. Genome-wide analysis of cell-cycle regulators revealed a number of kinase, phosphatase, and proteolytic proteins and also suggests that processes thought to regulate G1-S phase progression like receptor-mediated signaling, nutrient status, and translation also play important roles in the regulation of G2/M phase transition. Moreover, 15 genes that are integral to TNF/NF-κB signaling were found to regulate G2/M, a previously unanticipated role for this pathway. These analyses provide systems-level insight into both known and novel genes as well as pathways that regulate cell-cycle progression, a number of which may provide new therapeutic approaches for the treatment of cancer.

I recently wrote a commentary about how phenotypes in yeast agree remarkably well with the just-in-time assembly hypothesis for cell-cycle regulation of protein complexes. I thus decided to also compare the dataset on cell-cycle phenotypes for human genes with the cell-cycle microarray expression data published in 2002 by Whitfield and coworkers.

Using the mapping files from the STRING database, I was able to automatically map 741 of the 1152 genes with cell-cycle phenotypes to the set of 12,097 genes for which we have cell-cycle microarray expression data. Of the 741 genes, 55 are among the a of 600 periodically expressed genes identified in a reanalysis of the data from Whitfield and coworkers. This is just shy of 50% more than what would be expected by random chance (P < 0.001; Fisher’s exact test).

The authors divided the cell-cycle mutants into eight classes. Repeating the above analysis for each of these categories separately revealed that genes with phenotypes related to S-phase and cytokinesis were significantly overrepresented among the 600 periodically expressed genes (FDR < 0.05; Fisher’s exact test and Benjamini-Hochberg correction for multiple testing). The other categories did not yield statistically significant results.

To look at the temporal regulation of transcription in more detail, I plotted the distribution of peak times (the point in the cell cycle when a gene is maximally expressed) for the periodically expressed genes from each of the eight phenotypic categories:

Peak time distributions for human genes with cell-cycle-related phenotypes

For the periodically expressed genes that display a cell-cycle phenotype in the screen by Schultz and coworkers, the observed phenotypes agree with the time of peak expression. In particular, the genes with cytokinesis-related phenotypes are all expressed shortly before the time of cell division (cytokinesis). Most of the periodically expressed genes with phenotypes related to S phase are similarly expressed during S phase (roughly 50-70% into the cell cycle), genes with phenotypes related to the G2/M transition also tend to be expressed during the appropriate phase of the cell cycle.

In summary, these results support the view that cell-cycle-regulated genes are expressed shortly before their time of action, despite the fact that regulation also takes place at the protein level. It also confirms that many genes with cell-cycle function are not subject to transcriptional cell-cycle regulation.

WebCiteCite this post

Update: Not treasure but buried

There is good news regarding the Warda and Han scandal. After numerous researchers including myself emailed the Editor in Chief of Proteomics, Michael J. Dunn, the paper is now listed as retracted. I am pleased to see that the editorial team of Proteomics has acted swiftly against plagiarism.

Edit: The last author of the paper, Jin Han, has written a reply to PZ Myers. According to the email, he has himself contacted the editorial office and requested that the paper be retracted. I am still looking forward to hearing an official explanation from Proteomics of how this paper got accepted in the first place.

Edit: Michael J. Dunn has emailed me a copy of the approved press release from Proteomics that announces the retraction of the paper by Warda and Han. The only explanation offered is that the paper made it through peer review due to “human error” – or in other words “someone did something wrong”. I would have been truly worried if a paper like this had been accepted without human error being involved. I hope that Proteomics will provide the scientific community with more details when they have completed the internal investigation of the incident.

WebCiteCite this post

Analysis: The law of diminishing returns

The law of diminishing returns is a well known concept in economics. Highly simplified, it states that as you invest more, the overall return on investment increases at a declining rate. I wondered if this principle applies to biomedical research.

I thus wrote a small script to parse the Medline database and count for each year 1) the number of new papers published, 2) the number of authors that published at least one paper, and 3) the total number of (co-)authorships. The plot below shows the number of new papers and the number of active authors for each year since 1970:

Exponential growth in the number of papers and authors

Few scientists – if any – will be surprised to see that the rate of publication and the number of active publishing scientists have increased exponentially. However, it is slightly disconcerting that the number active authors doubles every 17 years whereas the number of papers per year doubles only every 22 years.

To look deeper into this, I plotted as function of time the average number of coauthors per paper and the average number of papers coauthored by each active author:

Exponential increase in the number of authorships per paper and per author

These two measures also appear to increase exponentially. However, the number of coauthors per paper is increasing considerably faster than the number of papers coauthored by each author per year. The estimated doubling times are 33 years for the number of coauthors per paper and 63 years for the number of papers coauthored. This suggests that the productivity of biomedical scientists, measured in terms of publications, has decreased.

A more direct way to show this is to plot the ratio between the number of papers published each year and the number of authors on them (note that the y-axis does not start at zero):

The productivity in terms of papers is decreasing

The fact is that the number of papers produced per researcher per year has dropped by roughly one third since 1970. However, there could be many reasons for this:

  • Have we simply become lazy?
  • Has the bar been raised for what is considered the Least Publishable Unit?
  • Are large collaborations less efficient than smaller projects?
  • Do we spend more time on bureaucracy and less time on science?
  • Or are we left with the hard questions because the easy ones have all been answered?

My guess is that the last three reasons all play important roles. What do you think?

WebCiteCite this post