Monthly Archives: May 2008

Analysis: A democratic approach to identification of cell-cycle-regulated genes

Over the years several microarray time-course experiments have been performed to identify the genes that are transcriptionally regulated during the mitotic cell cycle, i.e the periodically expressed genes. Moreover, bioinformaticians have developed many different computational methods for identifying the periodically expressed genes from microarray time-course data.

Below is a list of the experimental and computational analyses of the budding yeast cell cycle that I am aware of (please notify me if you know of other microarray experiments or computational methods):

  1. Cho et al., Mol. Cell, 1998
  2. Spellman et al., Mol. Biol. Cell, 1998
  3. Zhao et al., Proc. Natl. Acad. Sci. USA, 2001
  4. Langmead et al., Proc. IEEE Comput. Soc. Bioinformatics Conf., 2002
  5. Langmead et al.,RECOMB, 2002
  6. Langmead et al., J. Comput. Biol., 2003
  7. de Lichtenberg et al., J. Mol. Biol., 2003
  8. Johansson et al., Bioinformatics, 2003
  9. Wichert et al., Bioinformatics, 2004
  10. Lu et al., Nucleic Acids Res., 2004
  11. Luan and Li, Bioinformatics, 2004
  12. de Lichtenberg et al., Bioinformatics, 2005
  13. de Lichtenberg et al., Yeast, 2005
  14. Willbrand et al., Bioinformatics, 2005
  15. Ahdesmäki et al., BMC Bioinformatics, 2005
  16. Chen, BMC Bioinformatics, 2005
  17. Qiu et al., Conf. Proc. IEEE Eng. Med. Biol. Soc., 2005
  18. Qiu et al., Bioinformatics, 2006
  19. Andersson et al., BMC Bioinformatics, 2006
  20. Gan et al., Int. Conf. Pattern Recog., 2006
  21. Glynn et al., Bioinformatics, 2006
  22. Ahnert et al., Bioinformatics, 2006
  23. Lu et al., Bioinformatics, 2006
  24. Xu et al., LSS Comput. Syst. Bioinformatics Conf., 2006
  25. Pramilla et al., Genes Dev., 2006
  26. Liew et al, BMC Bioinformatics, 2007
  27. Lu et al., Genome Biol., 2007
  28. Morton et al., Stat. Appl. Genet. Mol. Biol., 2007
  29. Rowicka et al., Proc. Natl. Acad. Sci. USA, 2007
  30. Gauthier et al., Nucleic Acids Res., 2008
  31. Orlando et al., Nature, 2008

These studies have reported a mixture of ranked and unranked lists of periodically expressed genes. By that I mean that some studies provided a list of genes sorted according to how periodic the expression profiles appear, whereas others simply provide a list of the genes deemed periodic. For the ranked lists, I first checked the publications to see if the authors suggested a cutoff for the number of periodically expressed genes, in which case I followed their recommendations. If the authors suggested multiple lists of varying confidence, I used the highest-confidence list. If no cutoff was proposed, I selected the top-300 genes if the list was based on a single time course and the top-500 genes if the list was based on three or more time courses. It should be noted that both of these cutoffs are on the conservative side since most studies propose 800 or more periodically expressed genes when combining multiple expression time courses.

This meta-analysis resulted in a list of more than 4200 budding yeast genes that are periodically expressed according to at least one of the methods listed above; that is more than two-thirds of all genes encoded by the budding yeast genome!

To investigate further how such a large number of genes can have been proposed to be periodically expressed, I plotted how many of these genes are on how many of the lists of periodically expressed genes:

The histogram reveals that the majority of the over 4200 genes have been proposed by only one or two analyses. It seems reasonable to assume that the genes that have been proposed as periodically expressed by only one or a few methods are less likely to be correct than the ones that many methods agree on. Also, one could expect that taking the consensus of many methods would yield a more reliable answer than using just a single method.

To test these two hypotheses, I compared two different ways of identifying the periodically expressed genes:

  1. Ranking the genes based on a single scoring scheme that combines all the available experimental data (Gauthier et al., Nucleic Acids Res., 2008)
  2. Ranking the genes based on vote among 30 different methods (not 31; the analysis by Orlando and coworkers was left out of the voting as this dataset is not included in

To benchmark the two methods, I compared the ranked lists to a set of target genes for cell-cycle transcrition factors identified in genome-wide ChIP-on-chip experiments and plotted the fraction of these that were identified as function of the number of genes proposed to be periodically expressed:

The plot confirms that genes proposed to be periodically by multiple methods are more likely to be targets of cell-cycle transcription factors, and are hence more likely to truly be subject to transcriptional cell-cycle regulation. However, it also shows that the list obtained by voting among 30 methods is a bit worse than what is obtained by using the single best method.

This result may come as a surprise to many since meta-servers that combine multiple prediction methods have in the past proven very successful for many other bioinformatics tasks. I suspect that the approach fails in this case for two reasons: first, many of the analyses included perform considerably worse than the best one, and second, most of the methods make use of only half of the available experimental data. It may thus be possible to obtain better results by selecting only a subset of the methods and rerunning each of them on all the available data. So far, however, dictatorship seems to work better than democracy for identification of periodically expressed genes.

WebCiteCite this post

Commentary: Does size matter?

I recently took a look at colonization of titles and found that the fraction of papers with colons in their titles is increasing steadily. Intuitively, one would thus expect that the average length of the titles has also increased. The plot below shows that this is indeed the case (not that the y-axis does not begin at zero):

The average title length has increased from 8.5 words in 1950 to 12.5 words in 2008. Strangely, the increase is almost perfectly linear except for a fluctuation in the early 60s – I have no idea why this is the case.

But is the title length of a paper important? I personally expected that papers with short, catchy titles would be cited more than papers with longer, more complex titles. Lacking citation information for individual publications, I thus calculated average title length for publications from each journal and correlated it with the ISI impact factor of the corresponding journal:

No correlation is observed between the impact factor of a journal and the average title length of the papers published therein. So we can conclude that – at least for titles of scientific papers – size does not matter.

WebCiteCite this post