Commentary: Does size matter?

May 6, 2008

I recently took a look at colonization of titles and found that the fraction of papers with colons in their titles is increasing steadily. Intuitively, one would thus expect that the average length of the titles has also increased. The plot below shows that this is indeed the case (not that the y-axis does not begin at zero):

The average title length has increased from 8.5 words in 1950 to 12.5 words in 2008. Strangely, the increase is almost perfectly linear except for a fluctuation in the early 60s - I have no idea why this is the case.

But is the title length of a paper important? I personally expected that papers with short, catchy titles would be cited more than papers with longer, more complex titles. Lacking citation information for individual publications, I thus calculated average title length for publications from each journal and correlated it with the ISI impact factor of the corresponding journal:

No correlation is observed between the impact factor of a journal and the average title length of the papers published therein. So we can conclude that - at least for titles of scientific papers - size does not matter.


Commentary: Colonization of titles

April 22, 2008

You have probably noticed that a high fraction of scientific papers have colons in their titles. Several people have written humorous commentaries on this. Although these authors clearly see the use of colons as a growing trend, they did not present hard evidence for the increase in the usage of colons in the titles of scientific publications.

Out of curiosity, I thus wrote a small script to count the fraction of papers in Medline that have colons in their titles for each of the past 25 years. The result is shown in the plot below (note that the y-axis does not start at zero):

The conclusion is very clear: the fraction of titles with colons has increased linearly from 15% to 24% over the past 20 years. One could object that this effect may be explained by the increase in apologies (which often have a title “Retraction: …”) or by the NAR special issues on databases and web servers (which contain hundreds papers with titles such as “YADB: yet another database”). However, these add up to less than 2% of the papers with colonized titles and are thus insufficient to explain the observed 9% increase.


Analysis: Cell-cycle expression of cancer genes

April 15, 2008

I have long used a data integration approach to obtain a global picture of eukaryotic cell-cycle regulation. The cell cycle is a popular research topic in part because of its importance for cancer research. I thus recently compared microarray expression data on the human cell cycle to genes with mutations that have been causally implicated in various forms of cancer.

From the Cancer Genome Project website, I downloaded a list of 353 human genes that are implicated in cancer. Using the identifier mapping file from STRING, I was able to automatically map 338 of these genes to the set of human genes from Ensembl that I used in earlier cell-cycle studies. 295 of the 338 genes were present on the microarrays used in the cell-cycle expression study by Whitfield et al. (2002). However, only 23 of these are among the 600 periodically expressed genes identified in the reanalysis by Jensen et al. (2006). The many numbers are illustrated in the diagram below:

By random chance, 295*600/12097 = 15 of the 295 genes would be expected to be periodically expressed, and the enrichment is thus only a bit over 1.5 fold. Although this enrichment is statistically significantly (P < 3%, Fisher’s exact test), the correlation is clearly not strong enough to allow prediction of novel cancer genes.

My step was to look at the evolutionary conservation of the 23 periodically expressed cancer genes. Only 12 of them belong to an orthologous group. Half of them do thus not appear to have orthologs in budding yeast, fission yeast, or Arabidopsis thaliana. Only three periodically expressed cancer genes have orthologs in all of these organisms. One of these genes is periodically expressed onlt in human, one in human and fission yeast, and one in all four organisms (a histone subunit).

In summary, it seems that one cannot say much about cancer based on cell-cycle mRNA expression data. This is perhaps not surprising considering that the transcriptional regulation does not seem to vary much between cancer cells and normal cells.


Analysis: Cancer or not, cell-cycle expression stays the same

April 13, 2008

The groups of Ziv Bar-Joseph and Itamar Simon recently published a paper in PNAS on a new microarray study of the cell cycle of primary human fibroblasts:

Genome-wide transcriptional analysis of the human cell cycle identifies genes differentially regulated in normal and cancer cells

Characterization of the transcriptional regulatory network of the normal cell cycle is essential for understanding the perturbations that lead to cancer. However, the complete set of cycling genes in primary cells has not yet been identified. Here, we report the results of genome-wide expression profiling experiments on synchronized primary human foreskin fibroblasts across the cell cycle. Using a combined experimental and computational approach to deconvolve measured expression values into ‘‘single-cell’’ expression profiles, we were able to overcome the limitations inherent in synchronizing nontransformed mammalian cells. This allowed us to identify 480 periodically expressed genes in primary human foreskin fibroblasts. Analysis of the reconstructed primary cell profiles and comparison with published expression datasets from synchronized transformed cells reveals a large number of genes that cycle exclusively in primary cells. This conclusion was supported by both bioinformatic analysis and experiments performed on other cell types. We suggest that this approach will help pinpoint genetic elements contributing to normal cell growth and cellular transformation.

In contrast to the earlier study by Whitfield et al. (2002), which was performed on HeLa cells, Ziv Bar-Joseph et al. worked on non-transformed fibroblasts. The dataset thus offers a first global view of the differences between the cell cycle of normal human cells and that of cancer cells.

To compare their list of cell-cycle-regulated human genes to the one the I have used so far, I mapped their 480 genes to Ensembl using the mapping file from the STRING database. This resulted in a list of 410 genes, that is 70 genes could not be mapped by the automatic procedure. Whereas this is far from a perfect mapping, it is sufficient to judge the quality of the list.

The plots below show the fraction of a benchmark set that is identified as function of the number of genes that is proposed to be periodically expressed during the cell cycle. In each plot, I compare the results for the list of 410 obtained from the new study by Bar-Joseph et al., the analysis by Whitfield et al., and the reanalysis of the latter dataset by Jensen et al. (2006) (available from Cyclebase.org). To make the comparison as fair as possible, I only considered the subset of genes that were present in both microarray designs. The first plot uses as benchmark a set of 63 genes that have been identified as periodically expressed in targeted small-scale studies:

Three sets of cell-cycle-regulated human genes compared to benchmark set B1

I also benchmarked the three gene lists against a second benchmark set, which consists of predicted target genes of E2F cell-cycle transcription factors:

Three sets of cell-cycle-regulated human genes compared to benchmark set B2

Both benchmarks suggest that the three lists are of very comparable quality, but that the list by Whitfield and coworkers is much more inclusive than the one from Bar-Joseph and coworkers. In other words, the former list has better sensitivity whereas the latter has better specificity. This is consistent with the results presented by Bar-Joseph et al., who conclude that their list is more reliable than the previously published list. However, this is probably not due to better quality of the raw expression data, since reanalysis of the data by Whitfield et al. yielded a list with almost identical sensitivity and specificity (that is the red curve is very close to the blue cross in both plots).

Although the two lists of periodically expressed are of comparable quality, they may still contain very different sets of genes. I therefore decided to compare the list of genes that are periodically expressed in the time course on primary fibroblasts and in each of the four time courses on HeLa cells. To make this comparison as easy as possible, I selected the top-364 cycling genes from each of the four HeLa time courses based on the reanalysis by Jensen et al. (2006). The ten Venn diagrams below show all pairwise comparisons of the five lists of 364 genes each:

The average overlap between the list by Bar-Joseph et al. and an experiment from Whitfield et al. is 114 genes. By comparison, the average overlap between the top-364 lists from two individual experiments from Whitfield et al. is 123 genes. Although the overlap may seem low, I thus believe that it is due to the poor reproducibility between microarray time courses rather than due to genuine differences between primary fibroblasts and HeLa cells as suggested by Bar-Joseph and colleagues.

Although cancer cells have to circumvent the regulatory mechanisms that would normally prevent cell proliferation, the cell cycle itself appears to function the same way as in normal cells. In other words, the difference does not lie in the “engine” but in the “brakes”, which have been sabotaged in cancer cells.


Commentary: Viewing the cell cycle in a new light

April 6, 2008

Atsushi Miyawaki’s lab from RIKEN has recently published a Cell paper that describes a novel approach for how to monitor cell-cycle progression of individual cells:

Visualizing spatiotemporal dynamics of multicellular cell-cycle progression

The cell-cycle transition from G1 to S phase has been difficult to visualize. We have harnessed antiphase oscillating proteins that mark cell-cycle transitions in order to develop genetically encoded fluorescent probes for this purpose. These probes effectively label individual G1 phase nuclei red and those in S/G2/M phases green. We were able to generate cultured cells and transgenic mice constitutively expressing the cell-cycle probes, in which every cell nucleus exhibits either red or green fluorescence. We performed time-lapse imaging to explore the spatiotemporal patterns of cell-cycle dynamics during the epithelial-mesenchymal transition of cultured cells, the migration and differentiation of neural progenitors in brain slices, and the development of tumors across blood vessels in live mice. These mice and cell lines will serve as model systems permitting unprecedented spatial and temporal resolution to help us better understand how the cell cycle is coordinated with various biological events.

The clever idea was to fuse a red- and a green-emitting fluorescent protein to Cdt1 and Geminin, respectively. Cdt1 is ubiquitinated by SCFSkp2 at the onset of S phase, which causes it to be rapidly degraded by the proteasome, whereas Geminin is targeted for proteolytic degradation by APCCdh1 in late M phase. By fluorescent labeling of two proteins, Miyawaki and colleagues managed to make mouse cells that become increasingly red during G1 phase, yellow around the G1/S transition, and increasingly green through S, G2, and M phase. It is thus possible to monitor the cell-cycle states of individual cells with a microscope.

The movie below follows a few HeLa cells for 3-4 cell cycles:

The authors also show how their construct can be used for imaging the cell-cycle state of the cells in a slice of a mouse brain or a mouse embryo. I expect that this will become an indispensable tool for unraveling the links between cell-cycle control and developmental processes.

For more details, I strongly recommend that you read Jake Young’s post at Pure Pedantry.


Commentary: Much ado about alignments

March 30, 2008

There seems to be a new trend in computational biology: worrying about sequence alignments. Over the past couple of months, two high-profile papers have appeared that flaws related to sequence alignment methods.

The first paper appeared in Science Magazine in January this year. Wong and coworkers describe how uncertainties in multiple alignments can lead to errors in different phylogenetic trees:

Alignment Uncertainty and Genomic Analysis

The statistical methods applied to the analysis of genomic data do not account for uncertainty in the sequence alignment. Indeed, the alignment is treated as an observation, and all of the subsequent inferences depend on the alignment being correct. This may not have been too problematic for many phylogenetic studies, in which the gene is carefully chosen for, among other things, ease of alignment. However, in a comparative genomics study, the same statistical methods are applied repeatedly on thousands of genes, many of which will be difficult to align. Using genomic data from seven yeast species, we show that uncertainty in the alignment can lead to several problems, including different alignment methods resulting in different conclusions.

The second paper appeared in Nature Biotechnology. Styczynski and coworkers discovered that the most commonly used substitution matrix, BLOSUM62, was calculated wrongly:

BLOSUM62 miscalculations improve search performance

The BLOSUM family of substitution matrices, and particularly BLOSUM62, is the de facto standard in protein database searches and sequence alignments. In the course of analyzing the evolution of the Blocks database, we noticed errors in the software source code used to create the initial BLOSUM family of matrices (available online). The result of these errors is that the BLOSUM matrices — BLOSUM62, BLOSUM50, etc. — are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.

Upon casual reading of these publications, one could get the idea that over a decade of work based on alignments, sequence similarity searches, and molecular evolution is wrong. Fortunately, this does not appear to be the case.

Starting with the second paper, I applaud the authors for discovering a mistake in such an established method, and I agree with them that it is remarkable that it has not been noticed before. However, I do not think that it is surprising that the ‘incorrect’ matrices work very well. Although they were not calculated as intended, the BLOSUM matrices have become the de facto standard precisely because they work as well as they do.

Regarding the first paper, I think it is fair to say that anyone working on multiple alignments and phylogeny are well aware that uncertain alignments can lead to wrong phylogenetic trees. This is why almost everyone uses programs like Gblocks to remove the ambiguous parts of their alignments before moving on to constructing phylogenetic trees. Unfortunately, Wong et al. instead constructed two sets of trees for each of the six multiple alignment methods: one based on the complete alignments, and one in which they excluded all gapped sites from the phylogenetic analysis. The latter is not equivalent to using a blocked alignment, since not all ambiguously aligned sites contain gaps, and since not all sites with gaps are ambiguously aligned.

Wong and coworkers subsequently compared the trees that they obtained using the six different alignment programs and found disagreements for almost half of all yeast proteins. This number may sound shockingly high, but I find it to be misleading in several ways. First, “disagreement” was defined as at least one of the six trees disagreeing with the others – much of the disagreement could thus be due to a single poorly performing alignment program. This definition also implies that the results can only get worse by adding more alignment methods to the comparison. Second, the comparison was not limited to the trees that are supported by bootstrap analysis – much of the disagreement is thus due to trees that we already know should not be trusted.

In my view, it would be more fair to make the comparison along the following lines:

  • Align the sequences as done by Wong et al.
  • Remove ambiguously aligned sites with Gblocks
  • Construct phylogenetic trees based on the blocked alignments
  • Calculate the bootstrap support for each tree
  • Discard trees with poor bootstrap support
  • Calculate the agreement on tree topology for each pair of alignment methods

This procedure will ensure that trees are not distorted by the unreliable parts of the alignments, that comparisons are not based on trees we know are unreliable, that the results are not skewed by a single poorly performing alignment method, and that the numbers remain comparable if more alignment methods are added. I have already downloaded all the alignments and run then through Gblocks; please let me know if you would like to continue the analysis from that step, and I will arrange a way to transfer the files.

Time might prove me wrong, but I expect that such an analysis will show that alignment uncertainty is not a major factor that needs to be taken into account when constructing phylogenetic trees.


Editorial: Live blogging - not so easy

March 29, 2008

I am now back from two weeks in Italy where I experimented with live blogging. You have probably noticed that some presentations from the meeting at CoSBi in Trento were covered on Buried Treasure within a matter of minutes of them ending. Also, quite a number of pictures were posted in the associated Picasa web album while the presentations were still ongoing. Here is a brief explanation of how I planned to pull this off and how it worked in practice.

My original plan was to use WordPress through the web browser on my smartphone together with a foldable bluetooth keyboard. This was how I first imagined that my live-blogging platform would look:

Live blogging - WordPress, HTC S710, and bluetooth keyboard

It seemed a good idea at the time, but there were a couple of “minor” problems:

I thus started looking around for alternative clients for WordPress and eventually found ShoZu, which allows you to upload pictures from your phone to a variety of services including WordPress blogs and Picasa web albums. However, it is not a true blogging tool and only enables you to write a short description for each picture. I thus accepted to the real blog posts would be written on my laptop, whereas the following platform would be used for live blogging in the form of images with short descriptions:

Live blogging - ShoZu and HTC S710

This seemed like an even better idea at the time, but again I ran into a few technical problems:

  • Due to strange combinations of firewalls, HTTP proxies, and complex login web pages, I never managed to get my smartphone reliably connected to the internet.
  • The camera in the smartphone was unable to take even half decent picture under the poor light conditions.

In reality, I thus ended up using my old Apple PowerBook G4 and my Lumix TZ3 camera. They got the job done in terms of covering the presentations, but live blogging from poster sessions was practically impossible.

I have now put on my thinking cap to come up with a live-blogging platform that would work for poster sessions. You generally have too little time for too many posters, so it has to be very fast to snap a photo and post it. The light is often poor and people tend to use too small fonts on their posters, so you need a good camera to get a readable result. Finally, the lack of space and tables prevents you from using a laptop. The Eye-Fi Card  might be a solution as it would enable me to upload images directly from my camera to, for example, a Picasa web album or Flickr. Please let me know if you have any ideas, experiences, or thoughts on this.


Analysis: The budding yeast phosphoproteome

March 23, 2008

The group of Donald F. Hunt at University of Virginia has recently published a paper in PNAS that describes a new phosphoproteomics study of budding yeast:

Analysis of phosphorylation sites on proteins from Saccharomyces cerevisiae by electron transfer dissociation (ETD) mass spectrometry

We present a strategy for the analysis of the yeast phosphoproteome that uses endo-Lys C as the proteolytic enzyme, immobilized metal affinity chromatography for phosphopeptide enrichment, a 90-min nanoflow-HPLC/electrospray-ionization MS/MS experiment for phosphopeptide fractionation and detection, gas phase ion/ion chemistry, electron transfer dissociation for peptide fragmentation, and the Open Mass Spectrometry Search Algorithm for phosphoprotein identification and assignment of phosphorylation sites. From a 30-microg (approximately 600 pmol) sample of total yeast protein, we identify 1,252 phosphorylation sites on 629 proteins. Identified phosphoproteins have expression levels that range from <50 to 1,200,000 copies per cell and are encoded by genes involved in a wide variety of cellular processes. We identify a consensus site that likely represents a motif for one or more uncharacterized kinases and show that yeast kinases, themselves, contain a disproportionately large number of phosphorylation sites. Detection of a pHis containing peptide from the yeast protein, Cdc10, suggests an unexpected role for histidine phosphorylation in septin biology. From diverse functional genomics data, we show that phosphoproteins have a higher number of interactions than an average protein and interact with each other more than with a random protein. They are also likely to be conserved across large evolutionary distances.

As is so often the case with experimental papers, no comparison is provided to earlier studies. I thus decided to compare the set of phosphoproteins identified by Hunt and coworkers to the set of Cdc28p substrates identified in two studies by the Morgan lab as well as to the proteome-wide, sequence-based predictions made by NetPhosK:

Venn diagram comparing three sets of phosphoproteins from budding yeast

The Venn diagram obviously shows that each of the three sets contains a considerable number of phosphoproteins that are not present in any of the other sets. This was to be expected since the three methods are fundamentally very different. The dataset from the Hunt lab includes proteins that are phosphorylated by other kinases than Cdc28p; however, it is limited in the sense that low-abundance phosphopeptides are typically missed by MS studies. Conversely, the set from the Morgan lab consists only of Cdc28p substrates, but is likely to have much better coverage of low-abundance phosphoproteins. Finally, the set of Cdc28 substrates from NetPhosK is likely to contain a considerable number of false positives as they are predicted from the protein sequence alone.

As a matter of fact, I find the overlap between the three sets to be surprisingly good. Even if we assume that the dataset from the Morgan lab contains no false positives, the overlap suggests that the new dataset from Hunt and coworkers captures one third of all phosphoproteins in budding yeast; assuming errors in both datasets increases this estimate. It is also noteworthy that NetPhosK misses only 22% of the Cdc28p that were identified by the Morgan lab and supported by the new data from the Hunt lab, although this high coverage is probably obtained at the price of many false positive predictions.


Editorial: No intelligence involved

March 22, 2008

You may have heard about the controversial movie “Expelled: No intelligence Allowed” by Ben Stein in which people behind the intelligent design movement whine about being suppressed the scientific community. The truth is obviously that intelligent design is not a falsifiable theory and hence simply does not qualify as science.

However, the movie is also controversial in other respects. To start with the producers conned both Richard Dawkins and fellow blogger PZ Myers into participating in the movie by interviewing them under false pretense.

Richard Dawkins and PZ Myers thus both registered for participating in a public screening of the movie. But while queuing up for the movie, PZ Myers was identified by security officers and told to leave the premises - immediately! Oh the irony, oh the double standard. They make a movie about suppression of views, they call it “Expelled”, and then they expel a person whom you conned into participating in the movie because you disagree with his views.

But it gets even better. The very same security officers were apparently oblivious to the fact that Richard Dawkins was standing right next to PZ Myers and thus let him enter to watch the movie. PZ Myers immediately wrote a blog post about it, while the movie was still being shown to the audience - including Richard Dawkins.

After the movie, Richard Dawkins (of course) stood up and asked why PZ Myers was not allowed to see the movie. The answer? Because he did not have a ticket and was thus a gate crasher! Very interesting explanation since it was not a ticket event - you simply had to register a seat, which PZ Myers had done.

The two gentlemen have now posted an interesting little discussion on YouTube in which they humorously describe the incident as well as just how bad the movie really is:

Richard Dawkins also reveals that Expelled includes one of the beautiful movies produced by the multimedia team at Harvard. You really have to wonder if they actually got permission for that, if they conned the people at Harvard as well, or if they just resorted to plain old plagiarism. In any case, this has to be one of the biggest PR disasters ever made by the intelligent design movement.

Expelled from Expelled: no intelligence involved.


Live: Bioinformatics for Molecular Biologists

March 16, 2008

I have now arrived in Bertinoro where I will be lecturing on the 8th Course in Bioinformatics for Molecular Biologists. And after a fight with network configuration and power outages, I also eventually managed to get online.

All the speakers are housed at the castle, which has a fantastic view over the surrounding area - also by night:

Night view from the castle

The scientific part of the meeting was kicked off by H. Werner Mewes:

Opening lecture by H. Werner Mewes

I am sure there will be many interesting lectures to follow - and I hope that the audience will think that mine is one of them.