Monthly Archives: March 2008

Commentary: Much ado about alignments

There seems to be a new trend in computational biology: worrying about sequence alignments. Over the past couple of months, two high-profile papers have appeared that flaws related to sequence alignment methods.

The first paper appeared in Science Magazine in January this year. Wong and coworkers describe how uncertainties in multiple alignments can lead to errors in different phylogenetic trees:

Alignment Uncertainty and Genomic Analysis

The statistical methods applied to the analysis of genomic data do not account for uncertainty in the sequence alignment. Indeed, the alignment is treated as an observation, and all of the subsequent inferences depend on the alignment being correct. This may not have been too problematic for many phylogenetic studies, in which the gene is carefully chosen for, among other things, ease of alignment. However, in a comparative genomics study, the same statistical methods are applied repeatedly on thousands of genes, many of which will be difficult to align. Using genomic data from seven yeast species, we show that uncertainty in the alignment can lead to several problems, including different alignment methods resulting in different conclusions.

The second paper appeared in Nature Biotechnology. Styczynski and coworkers discovered that the most commonly used substitution matrix, BLOSUM62, was calculated wrongly:

BLOSUM62 miscalculations improve search performance

The BLOSUM family of substitution matrices, and particularly BLOSUM62, is the de facto standard in protein database searches and sequence alignments. In the course of analyzing the evolution of the Blocks database, we noticed errors in the software source code used to create the initial BLOSUM family of matrices (available online). The result of these errors is that the BLOSUM matrices — BLOSUM62, BLOSUM50, etc. — are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.

Upon casual reading of these publications, one could get the idea that over a decade of work based on alignments, sequence similarity searches, and molecular evolution is wrong. Fortunately, this does not appear to be the case.

Starting with the second paper, I applaud the authors for discovering a mistake in such an established method, and I agree with them that it is remarkable that it has not been noticed before. However, I do not think that it is surprising that the ‘incorrect’ matrices work very well. Although they were not calculated as intended, the BLOSUM matrices have become the de facto standard precisely because they work as well as they do.

Regarding the first paper, I think it is fair to say that anyone working on multiple alignments and phylogeny are well aware that uncertain alignments can lead to wrong phylogenetic trees. This is why almost everyone uses programs like Gblocks to remove the ambiguous parts of their alignments before moving on to constructing phylogenetic trees. Unfortunately, Wong et al. instead constructed two sets of trees for each of the six multiple alignment methods: one based on the complete alignments, and one in which they excluded all gapped sites from the phylogenetic analysis. The latter is not equivalent to using a blocked alignment, since not all ambiguously aligned sites contain gaps, and since not all sites with gaps are ambiguously aligned.

Wong and coworkers subsequently compared the trees that they obtained using the six different alignment programs and found disagreements for almost half of all yeast proteins. This number may sound shockingly high, but I find it to be misleading in several ways. First, “disagreement” was defined as at least one of the six trees disagreeing with the others – much of the disagreement could thus be due to a single poorly performing alignment program. This definition also implies that the results can only get worse by adding more alignment methods to the comparison. Second, the comparison was not limited to the trees that are supported by bootstrap analysis – much of the disagreement is thus due to trees that we already know should not be trusted.

In my view, it would be more fair to make the comparison along the following lines:

  • Align the sequences as done by Wong et al.
  • Remove ambiguously aligned sites with Gblocks
  • Construct phylogenetic trees based on the blocked alignments
  • Calculate the bootstrap support for each tree
  • Discard trees with poor bootstrap support
  • Calculate the agreement on tree topology for each pair of alignment methods

This procedure will ensure that trees are not distorted by the unreliable parts of the alignments, that comparisons are not based on trees we know are unreliable, that the results are not skewed by a single poorly performing alignment method, and that the numbers remain comparable if more alignment methods are added. I have already downloaded all the alignments and run then through Gblocks; please let me know if you would like to continue the analysis from that step, and I will arrange a way to transfer the files.

Time might prove me wrong, but I expect that such an analysis will show that alignment uncertainty is not a major factor that needs to be taken into account when constructing phylogenetic trees.

WebCiteCite this post

Editorial: Live blogging – not so easy

I am now back from two weeks in Italy where I experimented with live blogging. You have probably noticed that some presentations from the meeting at CoSBi in Trento were covered on Buried Treasure within a matter of minutes of them ending. Also, quite a number of pictures were posted in the associated Picasa web album while the presentations were still ongoing. Here is a brief explanation of how I planned to pull this off and how it worked in practice.

My original plan was to use WordPress through the web browser on my smartphone together with a foldable bluetooth keyboard. This was how I first imagined that my live-blogging platform would look:

Live blogging - WordPress, HTC S710, and bluetooth keyboard

It seemed a good idea at the time, but there were a couple of “minor” problems:

I thus started looking around for alternative clients for WordPress and eventually found ShoZu, which allows you to upload pictures from your phone to a variety of services including WordPress blogs and Picasa web albums. However, it is not a true blogging tool and only enables you to write a short description for each picture. I thus accepted to the real blog posts would be written on my laptop, whereas the following platform would be used for live blogging in the form of images with short descriptions:

Live blogging - ShoZu and HTC S710

This seemed like an even better idea at the time, but again I ran into a few technical problems:

  • Due to strange combinations of firewalls, HTTP proxies, and complex login web pages, I never managed to get my smartphone reliably connected to the internet.
  • The camera in the smartphone was unable to take even half decent picture under the poor light conditions.

In reality, I thus ended up using my old Apple PowerBook G4 and my Lumix TZ3 camera. They got the job done in terms of covering the presentations, but live blogging from poster sessions was practically impossible.

I have now put on my thinking cap to come up with a live-blogging platform that would work for poster sessions. You generally have too little time for too many posters, so it has to be very fast to snap a photo and post it. The light is often poor and people tend to use too small fonts on their posters, so you need a good camera to get a readable result. Finally, the lack of space and tables prevents you from using a laptop. The Eye-Fi Card  might be a solution as it would enable me to upload images directly from my camera to, for example, a Picasa web album or Flickr. Please let me know if you have any ideas, experiences, or thoughts on this.

Analysis: The budding yeast phosphoproteome

The group of Donald F. Hunt at University of Virginia has recently published a paper in PNAS that describes a new phosphoproteomics study of budding yeast:

Analysis of phosphorylation sites on proteins from Saccharomyces cerevisiae by electron transfer dissociation (ETD) mass spectrometry

We present a strategy for the analysis of the yeast phosphoproteome that uses endo-Lys C as the proteolytic enzyme, immobilized metal affinity chromatography for phosphopeptide enrichment, a 90-min nanoflow-HPLC/electrospray-ionization MS/MS experiment for phosphopeptide fractionation and detection, gas phase ion/ion chemistry, electron transfer dissociation for peptide fragmentation, and the Open Mass Spectrometry Search Algorithm for phosphoprotein identification and assignment of phosphorylation sites. From a 30-microg (approximately 600 pmol) sample of total yeast protein, we identify 1,252 phosphorylation sites on 629 proteins. Identified phosphoproteins have expression levels that range from <50 to 1,200,000 copies per cell and are encoded by genes involved in a wide variety of cellular processes. We identify a consensus site that likely represents a motif for one or more uncharacterized kinases and show that yeast kinases, themselves, contain a disproportionately large number of phosphorylation sites. Detection of a pHis containing peptide from the yeast protein, Cdc10, suggests an unexpected role for histidine phosphorylation in septin biology. From diverse functional genomics data, we show that phosphoproteins have a higher number of interactions than an average protein and interact with each other more than with a random protein. They are also likely to be conserved across large evolutionary distances.

As is so often the case with experimental papers, no comparison is provided to earlier studies. I thus decided to compare the set of phosphoproteins identified by Hunt and coworkers to the set of Cdc28p substrates identified in two studies by the Morgan lab as well as to the proteome-wide, sequence-based predictions made by NetPhosK:

Venn diagram comparing three sets of phosphoproteins from budding yeast

The Venn diagram obviously shows that each of the three sets contains a considerable number of phosphoproteins that are not present in any of the other sets. This was to be expected since the three methods are fundamentally very different. The dataset from the Hunt lab includes proteins that are phosphorylated by other kinases than Cdc28p; however, it is limited in the sense that low-abundance phosphopeptides are typically missed by MS studies. Conversely, the set from the Morgan lab consists only of Cdc28p substrates, but is likely to have much better coverage of low-abundance phosphoproteins. Finally, the set of Cdc28 substrates from NetPhosK is likely to contain a considerable number of false positives as they are predicted from the protein sequence alone.

As a matter of fact, I find the overlap between the three sets to be surprisingly good. Even if we assume that the dataset from the Morgan lab contains no false positives, the overlap suggests that the new dataset from Hunt and coworkers captures one third of all phosphoproteins in budding yeast; assuming errors in both datasets increases this estimate. It is also noteworthy that NetPhosK misses only 22% of the Cdc28p that were identified by the Morgan lab and supported by the new data from the Hunt lab, although this high coverage is probably obtained at the price of many false positive predictions.

WebCiteCite this post

Editorial: No intelligence involved

You may have heard about the controversial movie “Expelled: No intelligence Allowed” by Ben Stein in which people behind the intelligent design movement whine about being suppressed the scientific community. The truth is obviously that intelligent design is not a falsifiable theory and hence simply does not qualify as science.

However, the movie is also controversial in other respects. To start with the producers conned both Richard Dawkins and fellow blogger PZ Myers into participating in the movie by interviewing them under false pretense.

Richard Dawkins and PZ Myers thus both registered for participating in a public screening of the movie. But while queuing up for the movie, PZ Myers was identified by security officers and told to leave the premises – immediately! Oh the irony, oh the double standard. They make a movie about suppression of views, they call it “Expelled”, and then they expel a person whom you conned into participating in the movie because you disagree with his views.

But it gets even better. The very same security officers were apparently oblivious to the fact that Richard Dawkins was standing right next to PZ Myers and thus let him enter to watch the movie. PZ Myers immediately wrote a blog post about it, while the movie was still being shown to the audience – including Richard Dawkins.

After the movie, Richard Dawkins (of course) stood up and asked why PZ Myers was not allowed to see the movie. The answer? Because he did not have a ticket and was thus a gate crasher! Very interesting explanation since it was not a ticket event – you simply had to register a seat, which PZ Myers had done.

The two gentlemen have now posted an interesting little discussion on YouTube in which they humorously describe the incident as well as just how bad the movie really is:

Richard Dawkins also reveals that Expelled includes one of the beautiful movies produced by the multimedia team at Harvard. You really have to wonder if they actually got permission for that, if they conned the people at Harvard as well, or if they just resorted to plain old plagiarism. In any case, this has to be one of the biggest PR disasters ever made by the intelligent design movement.

Expelled from Expelled: no intelligence involved.

Live: Bioinformatics for Molecular Biologists

I have now arrived in Bertinoro where I will be lecturing on the 8th Course in Bioinformatics for Molecular Biologists. And after a fight with network configuration and power outages, I also eventually managed to get online.

All the speakers are housed at the castle, which has a fantastic view over the surrounding area – also by night:

Night view from the castle

The scientific part of the meeting was kicked off by H. Werner Mewes:

Opening lecture by H. Werner Mewes

I am sure there will be many interesting lectures to follow – and I hope that the audience will think that mine is one of them.

Live: Evolution of biological pathways

Orkun Soyer has just finished his excellent presentation at CoSBi on the use of toy models for understanding the principles that govern biological pathways, in particular signaling pathways. One can obviously imagine several scenarios for how pathways came about:

Evolution vs. intelligent design

The key point, however, is that we might be able to understand something about pathways through computational studies of simple toy models. The toy model discussed throughout the talk was bacterial chemotaxis:

Evolving “chemotaxis” in a computer

The idea is that evolution can to some extend be approximated as an optimization process, in which the objective function corresponds to fitness. In case of the “tumble or swim” problem, computational simulations allowed simple regulatory network to evolve that mimic the food-finding behavior of bacteria.

He also presented an interesting view on how biological complexity has evolved. The idea is to show how complex systems can evolve even if assuming a (weak) selection against complexity:

Modeling the evolution of complexity

I think that his results provide a lot of insight into how real signaling may have evolved, although all the simulations are based on simplistic toy models. I recommend that you download Orkun Soyer’s slides if you want to know more.

This talk ends the Computational and Systems Biology course at CoSBi.

Update: Warda and Han, one month after the storm

As most readers of this blog are probably aware, Mohammad Warda and Jin Han published a paper in Proteomics that contained several pages of text copied from unreferenced sources. Exactly one month ago it was thus retracted “due to a substantial overlap of the content of this article with previously published articles in other journals”.

Plagiarism is, however, not the main issue. The paper by Warda and Han also claimed to disprove the endosymbiotic origin of mitochondria, mentioned fingerprints of a mighty creator, and proposed mitochondria to be the missing link between the body and the preserved wisdom of the soul!

It remains a mystery how a manuscript with such unsubstantiated claims was accepted for publication in a respectable, peer-reviewed journal. The retraction notice by Proteomics made no attempt to explain this, and their approved draft press release merely states that it was due to “human error”. I would have been really worried if it could happen without human error being involved. Although this draft was approved a month ago, the final version is nowhere to be found on the internet, also not on the Proteomics website. I thus wonder if an official press release was ever published.

Attila Csordas, PZ Myers, Steven Salzberg, and I have decided to mark the one month anniversary of the retraction by pointing out the important questions that still remain to be answered by the Editor in Chief of Proteomics, Prof. Michael J. Dunn.

The manuscript contains four parts with unsupported claims that should have been caught by any peer reviewer or editor:

  1. Title – “Mitochondria, the missing link between body and soul”.
  2. Abstract – “These data are presented with novel proteomics evidence to disprove the endosymbiotic hypothesis of mitochondrial evolution that is replaced in this work by a more realistic alternative”.
  3. Section 3.4 – “More logically, the points that show proteomics overlapping between different forms of life are more likely to be interpreted as a reflection of a single common fingerprint initiated by a mighty creator than relying on a single cell that is, in a doubtful way, surprisingly originating all other kinds of life”.
  4. Conclusions – “We realize so far that the mitochondria could be the link between the body and this preserved wisdom of the soul devoted to guaranteeing life”.

My questions to Michael J. Dunn are when in the publication process these parts first appeared:

  1. Were they already in the initial version that was submitted to Proteomics and sent out for peer review?
  2. Did they appear in a revised version that was sent to the peer reviewers?
  3. Were they introduced in a revised version that was accepted without sending it to the reviewers?
  4. Or were they added at the copy editing stage, that is after the manuscript had formally been accepted?

I want to make explicit that the aim with these questions is not to place the blame but to elucidate what went wrong in the publication process. To prevent similar incidents inthe future, it is important to know whether the editor and the peer reviewers overlooked glaring flaws of the manuscript or if the flawed parts were introduced after peer review. It is not important who the editor and the peer reviewers are. I sincerely hope that Prof. Dunn will help improve the procedures for peer reviewed publication by answering the questions in this post and in the related posts on PIMM, Pharyngula, and Genomics, Evolution, and Pseudoscience.

WebCiteCite this post

Live: Networks, noise and survival in stress

Gabor Balazsi has just finished a very interesting presentation on the interplay between molecular networks, gene expression noise, and evolutionary selection – here is the opening slide:

Garbor Balazsi’s opening slide

In the first part of his talk he gave a nice introduction to global network topology and network motifs – this should be nothing new to people familiar with the work of the Barabasi and Alon labs. He also explained the “Commander, Intermediate, Executor” model for hierarchical regulatory networks, which I had personally not heard about before, and the concept of “origons”, which seems quite use for understanding the response of large signaling networks to environmental cues.

The second part of his talk was about stochastic noise in gene expression. Genetically identical cells in a culture may express the same protein at different levels; this is a result of random noise influencing transcription, mRNA degradation, translation, and protein degradation. This is simply a consequence of low copy numbers giving rise to stochastic, as opposed to deterministic, behavior.

Finally, he talked about how noise at the level of gene expression can influence the survival of species in a changing environment. This part of his talk was kicked off with the funniest slide of his presentation:

Gabor Balazsi’s funniest slide

I guess it should be seen as a lesson on how not to do. He made some very good points about how noise plays hardly any role in multicellular organisms that reproduce sexually. By contrast, stochastic variation within clonal bacterial cultures provides much higher chance of survival when faced with sudden stress such treatment with anti-bacterial drugs. I would have liked to hear more about this, but unfortunately there was not much time left for this part of the presentation due to technical problems with the projectors. It looks like Guy Shinar picked the safe strategy for his presentation.

All in all, I found it to be a really inspiring talk. I have uploaded his slides in case if you want to take a look at it.

Analysis: The transcriptional response to growth rate is unrelated to cell-cycle regulation

David Botstein’s group at Princeton recently published a paper in Molecular Biology of the Cell with the title “Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast”. As described in their abstract, they found interesting several correlations between the transcriptional responses to changes in growth rate and the regulation in response to stress and during the metabolic cycle:

We studied the relationship between growth rate and genome-wide gene expression, cell cycle progression, and glucose metabolism in 36 steady-state continuous cultures limited by one of six different nutrients (glucose, ammonium, sulfate, phosphate, uracil, or leucine). The expression of more than one quarter of all yeast genes is linearly correlated with growth rate, independent of the limiting nutrient. The subset of negatively growth-correlated genes is most enriched for peroxisomal functions, whereas positively correlated genes mainly encode ribosomal functions. Many (not all) genes associated with stress response are strongly correlated with growth rate, as are genes that are periodically expressed under conditions of metabolic cycling. We confirmed a linear relationship between growth rate and the fraction of the cell population in the G0/G1 cell cycle phase, independent of limiting nutrient. Cultures limited by auxotrophic requirements wasted excess glucose, whereas those limited on phosphate, sulfate, or ammonia did not; this phenomenon (reminiscent of the “Warburg effect” in cancer cells) was confirmed in batch cultures. Using an aggregate of gene expression values, we predict (in both continuous and batch cultures) an “instantaneous growth rate”. This concept is useful in interpreting the system-level connections among growth rate, metabolism, stress, and the cell cycle.

Because of my interest in cell cycle, their results regarding growth rate and cell-cycle regulation caught my attention. In Figure 6 of their paper, Brauer et al. show the slope distribution for the genes belonging to each of the phase-specific clusters defined by Spellman et al. (1998). The only trend they observe is that genes expressed at the G1/M transition.

I decided to redo the cell-cycle part of their analysis in a slightly different manner, hoping that I would be able to get a stronger signal than they did. Rather than using the 800 periodically expressed genes proposed by Spellman et al. (1998), I thus made use of the list of 600 periodically expressed genes from de Lichtenberg et al. (2005). Like Brauer et al., I found no difference in growth-rate response between cell-cycle-regulated genes and other genes. To analyze the phase-specific expression, I chose to plot the peak time distributions for genes that are up- and down-regulated in response to increasing growth rate:

Peak-time distribution for genes that are up- or down-regulated in response to increasing growth rate

In agreement with Brauer et al., genes that are down-regulated at high growth rates appear to have a striking preference for being expressed at the G1/M transition. However, manual inspection of these genes revealed that more than half of them belong to the Y’ family of DNA helicases, which are encoded by the sub-telomeric regions (striped blue bars). The trend observed by Brauer et al. is thus presumably not due to slower growing cells spending more time in M-G1 phase as suggested by the authors, Instead, it is likely an artifact of the many Y’ helicase genes found in the sub-telomeric regions of budding yeast, which are so highly homologous that they can cross hybridize on microarrays and hence all appear to be periodically expressed with identical peak times.

After correcting for this the down-regulated genes show a weak preference for being expressed during M phase whereas the up-regulated genes tend to be expressed in late G1 and S phase. However, the peak-time distributions of up- and down-regulated do not differ significantly from that of all cell-cycle-regulated genes (Kolmogorov-Smirnov test).

In summary, my reanalysis suggests that there is no correlation between the transcriptional response to changes in growth rate and transcriptional cell-cycle regulation. It also reiterates the importance of manually inspecting the results from statistical analyses – they may be highly significant for all the wrong reasons.

WebCiteCite this post