Archive for the 'Commentary' Category

Commentary: Intel’s take on GPU computing

February 7, 2011

A week or two ago, I published a post in which I argued that most papers, which report order of magnitude speedup of a bioinformatics algorithm by using graphics processors (GPUs), did so based straw man comparisons:

  • Massively parallel GPU implementations were compared to CPU implementations that did not make full use of the multi-core and SIMD (Single Instruction, Multiple Data) features.
  • The performance comparisons were done using very expensive GPU processing cards that cost as much, if not more, than the host computers.

It turns out that Lee and coworkers from Intel Corporation have performed a comparison that addresses both of these issues (thanks to Casey Bergman for making me aware of this). It appeared in 2010 in the proceedings of the 37th International Symposium on Computer Architecture:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Without wanting to question the integrity of the researchers, I read the paper with a critical mind for an obvious reason: they work for Intel who is a major player on the CPU market, but not on the GPU market. One always needs to have a critical mind when economic interests are involved. However, it is my clear impression that the researchers did their very best to make each and every algorithm run as fast as possible on the GPU as well as on the CPU.

The only factor I found, which may have tweaked the balance a bit in the favor of the CPU, was the choice of CPU and GPU. Whereas the Intel Core i7 960 was launched in October 2009, the Nvidia GTX 280 was launched in June 2008. That is a difference of 16 months, which by application of Moore’s law skews the results by almost 2x in favor of the CPU. The average speed-up provided by a high-end gaming GPU over a high-end CPU on this selection of algorithms is thus likely to be 4-5x. However, this advantage drops to about 3-4x if one corrects for the additional cost of the GPU, and to around 2x if one corrects for energy efficiency rather than for initial investment.

The findings of Lee and coworkers are consistent with my own conclusions, which were based on comparing two GPU-accelerated implementations of BLAST. In case of BLAST, the price-performance of GPU implementations ended up worse than that of the CPU implementation. Lee and coworkers found that for a wide variety of highly data-parallel algorithms (none of which are directly related to bioinformatics), only a modest speedup was attained. Not even a single algorithm got anywhere close to the promises of 100x or 1000x speedup, and a couple of algorithms ended up being slower on the GPU than on the CPU. This confirms my view that GPUs are presently not an attractive alternative to CPUs for most scientific computing needs.

Commentary: The GPU computing fallacy

January 28, 2011

Modern graphics processors (GPUs) deliver considerably more brute force computational power than traditional processors (CPUs). With NVIDIA’s launch of CUDA, general purpose GPU computing has become greatly simplified, and many research groups around the world have consequently experimented with how one can harvest the power of GPUs to speed up scientific computing.

This is also the case for bioinformatics algorithms. NVIDIA advertises a number of applications that have been adapted to make use of GPUs, including several applications for bioinformatics and life sciences, which supposedly speed up bioinformatics algorithms by an order of magnitude or more.

In this commentary I will focus primarily on two GPU-accelerated versions of NCBI-BLAST, namely CUDA-BLAST and GPU-BLAST. I do so not to specifically criticize these two programs, but because BLAST is the single most commonly used bioinformatics tool and thus a prime example for illustrating whether GPU acceleration of bioinformatics algorithms pays off.

Whereas CUDA-BLAST to the best of my knowledge has not been published in a peer-reviewed journal, GPU-BLAST is described in the following Bioinformatics paper by Vouzis and Sahinidis:

GPU-BLAST: using graphics processors to accelerate protein sequence alignment

Motivation: The Basic Local Alignment Search Tool (BLAST) is one of the most widely used bioinformatics tools. The widespread impact of BLAST is reflected in over 53 000 citations that this software has received in the past two decades, and the use of the word ‘blast’ as a verb referring to biological sequence comparison. Any improvement in the execution speed of BLAST would be of great importance in the practice of bioinformatics, and facilitate coping with ever increasing sizes of biomolecular databases.

Results: Using a general-purpose graphics processing unit (GPU), we have developed GPU-BLAST, an accelerated version of the popular NCBI-BLAST. The implementation is based on the source code of NCBI-BLAST, thus maintaining the same input and output interface while producing identical results. In comparison to the sequential NCBI-BLAST, the speedups achieved by GPU-BLAST range mostly between 3 and 4.

It took me a while to figure out from where the 3-4x speedup came. I eventually found it in Figure 4B of the paper. GPU-BLAST achieves an approximately 3.3x speedup over NCBI-BLAST in only one situation, namely if it is used to perform ungapped sequence similarity searches and only one of six CPU cores is used:

Speedup of GPU-BLAST over NCBI-BLAST as function of number of CPU threads used. Figure by Vouzis and Sahinidis.

The vast majority of use cases for BLAST require gapped alignments, however, in which case GPU-BLAST never achieves even a 3x speedup on the hardware used by the authors. Moreover, nobody concerned about the speed of BLAST would buy a multi-core server and leave all but one core idle. The most relevant speedup is thus the speedup achieved by using all CPU cores and the GPU vs. only the CPU cores, in which case GPU-BLAST achieves only a 1.5x speedup over NCBI-BLAST.

The benchmark by NVIDIA does not fair much better. Their 10x speedup comes from comparing CUDA-BLAST to NCBI-BLAST using only a single CPU core. The moment one compares to NCBI-BLAST running with 4 threads on their quad-core Intel i7 CPU, the speedup drops to 3x. However, the CPU supports hyperthreading. To get the full performance out of it, one should thus presumably run NCBI-BLAST with 8 threads, which I estimate will reduce the speedup of CUDA-BLAST vs. NCBI-BLAST to 2.5x at best.

Even these numbers are not entirely fair. They are based on the 3.5x or 4x speedup that one gets by running a single instance of BLAST with 4 or 6 threads, respectively. The typical situation when the speed of BLAST becomes relevant, however, is when you have a large number of sequences that need to be searched against a database. This is an embarrassingly parallel problem; by partitioning the query sequences and running multiple single-threaded instances of BLAST, you can get a 6x speedup on either platform (personal experience shows that running 8 simultaneous BLAST searches on a quad-core CPU with hyperthreading gives approximately 6x speedup).

It is not just BLAST

Optimists could argue that perhaps BLAST is just one of few bioinformatics problems that do not benefit from GPU computing. However, reading the recent literature, I think that GPU-BLAST is a representative example. Most publications about GPU acceleration of algorithms relevant to bioinformatics report speedups of at most 10x. Typically, this performance number represents the speedup that can be attained relative to a single-threaded version of the program running on the CPU, hence leaving most of the CPU cores standing idle. Not exactly a fair comparison.

Davis et al. recently published a sobering paper in Bioinformatics in which they made exactly that point:

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

Motivation: Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research.

Results: Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations.

Kudos for that. They could have published yet another paper with the title “N-fold speedup of algorithm X by GPU computing”. Instead they honestly reported that if one puts the same effort into parallelizing the CPU implementation as it takes to write a massively parallel GPU implementation, one gets about the same speedup.

GPUs cost money

It gets worse. Almost all papers on GPU computing ignore the detail that powerful GPU cards are expensive. It is not surprising that you can make an algorithm run faster by buying a piece of hardware that costs as much if not more than the computer itself. You could have spent that money buying a second computer instead. What matters is not the performance but the price/performance ratio. You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”.

Let us have a quick look at the hardware platforms used for benchmarking the two GPU-accelerated implementations of BLAST. Vouzis and Sahinidis used a server with an Intel Xeon X5650 CPU, which I was able to find for under $3000. For acceleration they used a Tesla C2050 GPU card, which costs another $2500. The hardware necessary to make BLAST ~1.5x faster made the computer ~1.8x more expensive. NVIDIA used a different setup consisting of a server equipped with an Intel i7-920, which I could find for $1500, and two Tesla C1060 GPU cards costing $1300 each. In other words, they used a 2.7x more expensive computer to make BLAST 2.5x faster at best. The bottom line is that the increase in hardware costs outstripped the speed increase in both cases.

But what about the energy savings?

… I hear the die-hard GPU-computing enthusiasts cry. One of the selling arguments for GPU computing is that GPUs are much more energy efficient than CPUs. I will not question the fact that the peak Gflops delivered by a GPU exceeds that of CPUs using the same amount of energy. But does this theoretical number translate into higher energy efficiency when applied to a real-world problem such as BLAST?

The big fan on an NVIDIA Tesla GPU card is not there for show. (Picture from NVIDIA’s website.)

As anyone who has build a gaming computer in recent years can testify, modern day GPUs use as much electrical power as a CPU if not more. NVIDIA Tesla Computing Processors are no exception. The two Tesla C1060 cards in the machine used by NVIDIA to benchmark CUDA-BLAST use 187.7 Watts each, or 375.6 Watts in total. By comparison a basic Intel i7 system like the one used by NVIDIA uses less than 200 Watts. The two Tesla C1060 cards thus triple the power consumption while delivering at most 2.5 times the speed. Similarly, the single Tesla C2050 card used by Vouzis and Sahinidis uses 238 Watts, which is around the same as the power requirement of their base hexa-core Intel Xeon system, thereby doubling the power consumption for less than a 1.5-fold speedup. In other words, using either of the two GPU-accelerated versions of BLAST appears to be less energy efficient than using NCBI-BLAST.

Conclusions

Many of the claims regarding speedup of various bioinformatics algorithms using GPU computing are based on faulty comparisons. Typically, the massively parallel GPU implementation of an algorithm is compared to a serial version that makes use of only a fraction of the CPU’s compute power. Also, the considerable costs associated with GPU computing processors, both in terms of initial investment and power consumption, are usually ignored. Once all of this has been corrected for, GPU computing presently looks like a very bad deal.

There is a silver lining, though. First, everyone uses very expensive Tesla boards in order to achieve the highest possible speedup over the CPU implementations, whereas high-end gaming graphics cards might provide better value for money. However, the evidence for this remains to be seen. Second, certain specific problems such as molecular dynamics probably benefit more from GPU acceleration than BLAST does. In that case, you should be aware that you are buying hardware to speed up one specific type of analysis rather than bioinformatics analyses in general. Third, it is difficult to make predictions – especially about the future. It is possible that future generations of GPUs will change the picture, but that is no reason for buying expensive GPU accelerators today.

The message then is clear. If you are a bioinformatician who likes to live on the bleeding edge while wasting money and electricity, get a GPU compute server. If on the other hand you want something generally useful and well tested and quite a lot faster than a GPU compute server … get yourself some computers.

Commentary: When Open Access isn’t

June 27, 2010

This week, PLoS ONE published an interesting paper by Bo-Christer Björk and coworkers on the free global availability of articles from scientific journals. One of the principal findings in this study is that 20.4% of articles published in 2008 are now available as Open Access (OA):

Open Access to the Scientific Journal Literature: Situation 2009

Background: The Internet has recently made possible the free global availability of scientific journal articles. Open Access (OA) can occur either via OA scientific journals, or via authors posting manuscripts of articles published in subscription journals in open web repositories. So far there have been few systematic studies showing how big the extent of OA is, in particular studies covering all fields of science.

Methodology/Principal Findings: The proportion of peer reviewed scholarly journal articles, which are available openly in full text on the web, was studied using a random sample of 1837 titles and a web search engine. Of articles published in 2008, 8,5% were freely available at the publishers’ sites. For an additional 11,9% free manuscript versions could be found using search engines, making the overall OA percentage 20,4%. Chemistry (13%) had the lowest overall share of OA, Earth Sciences (33%) the highest. In medicine, biochemistry and chemistry publishing in OA journals was more common. In all other fields author-posted manuscript copies dominated the picture.

Conclusions/Significance: The results show that OA already has a significant positive impact on the availability of the scientific journal literature and that there are big differences between scientific disciplines in the uptake. Due to the lack of awareness of OA-publishing among scientists in most fields outside physics, the results should be of general interest to all scholars. The results should also interest academic publishers, who need to take into account OA in their business strategies and copyright policies, as well as research funders, who like the NIH are starting to require OA availability of results from research projects they fund. The method and search tools developed also offer a good basis for more in-depth studies as well as longitudinal studies.

Having just set up a mirror of the OA subset of PubMed Central, I know that it contains only ~10% of the articles deposited in PubMed Central and only ~1% of the articles indexed by PubMed. It was thus with equal doses of joy and scepticism that I read numbers reported by Bo-Christer Björk and coworkers.

It soon became clear to me that the study did not adhere to the OA definition by the Budapest Open Access Initiative, which is as follows:

By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

The Bo-Christer Björk et al. do not define what exactly they mean by OA. However, from reading their paper is is pretty clear that any article for which they can get hold of free full text is counted as OA. The license under which the copy is distributed does not to matter, and they thus count the 90% of articles in PubMed Central that are published under non-OA licenses as OA. It does not even seem to matter if the free full text is legal or not, implying that any article of which an illegal copy can be found somewhere on the web is counted as OA.

I have heard of Gold OA and Green OA. It is tempting to call this Black OA. But I won’t. Because it just isn’t OA.

Commentary: On large protein complexes and the essentiality of hubs

August 2, 2008

In 2001, Jeong and coworkers published a paper in Nature in which they showed that the central proteins in interaction networks, that is the proteins with the highest connectivity, are enriched for essential proteins. This publication has been highly influential as evidenced by the numerous subsequent publications on the importance of “hub” proteins. Several hypothesis have been published that try to explain why hubs are essential, for example that certain protein interactions are essential and that a protein with many interactions is thus more likely to be involved in at least one essential interaction (He and Zhang, 2006).

Yesterday, Zotenko and coworkers published a paper in PLoS Computational Biology in which they take a closer look at the cause of this phenomenon:

Why Do Hubs in the Yeast Protein Interaction Network Tend To Be Essential: Reexamining the Connection between the Network Topology and Essentiality.

The centrality-lethality rule, which notes that high-degree nodes in a protein interaction network tend to correspond to proteins that are essential, suggests that the topological prominence of a protein in a protein interaction network may be a good predictor of its biological importance. Even though the correlation between degree and essentiality was confirmed by many independent studies, the reason for this correlation remains illusive. Several hypotheses about putative connections between essentiality of hubs and the topology of protein-protein interaction networks have been proposed, but as we demonstrate, these explanations are not supported by the properties of protein interaction networks. To identify the main topological determinant of essentiality and to provide a biological explanation for the connection between the network topology and essentiality, we performed a rigorous analysis of six variants of the genomewide protein interaction network for Saccharomyces cerevisiae obtained using different techniques. We demonstrated that the majority of hubs are essential due to their involvement in Essential Complex Biological Modules, a group of densely connected proteins with shared biological function that are enriched in essential proteins. Moreover, we rejected two previously proposed explanations for the centrality-lethality rule, one relating the essentiality of hubs to their role in the overall network connectivity and another relying on the recently published essential protein interactions model.

What Zotenko et al. show is, in other words, that essential hubs tend to be highly connected with each other and hence form large “Essential Complex Biological Modules”. Table 7 in their paper lists the Gene Ontology terms associated with these modules; among the recurring themes are “rRNA metabolic process”, “mRNA metabolic process”, “RNA splicing”, “ribosome biogenesis and assembly”, and “proteolysis”. These Gene Ontology terms obviously correspond to well known protein complexes, namely the RNA polymerases, the spliceosome, the ribosome, and the proteoasome. The analysis of Zotenko et al. thus suggests that the much debated correlation between centrality and essentiality is simply a consequence of the fact that many of the large protein complexes in a eukaryotic cell are essential, which is hardly surprising considering that they have been conserved through more than two billion years of evolution (Brocks et al., 1999).

Edit: For more views on the results of Zotenko et al. see the discussion on FriendFeed.

WebCiteCite this post

Commentary: Open access equals bulk publishing?

July 5, 2008

This week Nature published a News piece by Declan Butler with the rather provocative title “PLoS stays afloat with bulk publishing”. Unsurprisingly, this caused a backlash from open-access advocates in general and science bloggers in particular. Jonathan Eisen posted the ironic response “Only Nature could turn the success of PLoS One into a model of failure”. For an overview of the many other responses from the blogosphere see the summary by Coturnix and the long debate on FriendFeed.

The core of the criticism by Declan Butler was directed against the business model of the Public Library of Science (PLoS), in particular that a large part of their total income is produced by “bulk publishing” in the “database” PLoS ONE with only “light” peer review. There is no point in denying that PLoS ONE is a major source of income for PLoS, that it publishes many papers, and that it is not a top-tier journal. Still, it is in my view an unnecessary provocation to refer to a journal from a competitor as a “database” and between the lines suggest that they do not perform proper peer review.

I have nothing against Nature Publishing Group (NPG) – they are in my view one of the more progressive publishers with initiative such as Connotea and Nature Network. However, I find the criticism by Declan Butler somewhat unfair, especially considering that NPG also has a considerable number of lower impact journals in their portfolio in addition to their lineup of Nature journals. To illustrate this point, I looked up the impact factors for all the PLoS and NPG journals that I could find (6 and 68, respectively) and plotted the distributions:

The average impact factors of the two publishers are remarkably similar 9.19 for PLoS and 9.39 for NPG, but the underlying distributions are very different. Notably, the high average impact factor of NPG’s journals is due to a fairly small number of journals with impact factors over 20, which are sufficient to offset the large number of journals with impact factors below 5. Consequently, the median impact factors are 9.03 for PLoS and only 4.88 for NPG.

I want to be the first to point out the caveats of this analysis. First, the analysis above did not take into account that each journal does not publish the same number of papers. However, weighting the journals by number of papers when calculating average impact factors shifts the balance in favor of PLoS (9.79 for PLoS vs. 9.46 for NPG). Second, the journal PLoS ONE does not have an impact factor yet and was thus not included in my analysis. Third, the criticism by Declan Butler was mainly targeting the fact that much of PLoS’ revenue is due to PLoS ONE. However, until NPG chooses to make available detailed financial reports like PLoS does, it is impossible to tell how much of their revenue comes from lower-impact journals.

That being said, the business models of PLoS and NPG do not look all that different based on bibliographic metrics alone.

Full disclosure: I am an associate editor of PLoS Computational Biology.

WebCiteCite this post

Commentary: Summarizing papers as word clouds

June 27, 2008

For use in presentations on literature mining, I did a back-of-the-envelope calculation of how much time I would be able to spend on each new biomedical paper that is published. Assuming that all papers were indexed in PubMed (which they are not) and that I could read papers 24 hours per day all year around (which I cannot), the result is that I could allocate approximately 50 seconds per paper. This nicely illustrates the point that no one can keep up with the complete biomedical literature.

When I discovered Wordle, which can turn any text into a beautiful word cloud, I thus wondered if this visualization method would be useful for summarizing a complete paper as a single figure. To test this, I extracted the complete text of three papers that I coauthored in the NAR database issue 2008. Submitting these to Wordle resulted in the three figures below (click for larger versions):


All in all, I think that Wordle does a pretty good job at capturing the essence of each paper: the first cloud shows that STITCH is a database of interactions between proteins and chemicals, the second cloud shows that NetworKIN is a database predictions related to the kinases and phosphorylation, and the third cloud shows that Cyclebase.org is a database of experiments on gene expression during the cell cycle. However, a paper describing a database might be easier to summarize that a typical research paper.

As a final test, I therefore submitted the complete text from my paper “Evolution of Cell Cycle Control – Same molecular machines, different regulation”, which describes the somewhat complex concept of just-in-time assembly to Wordle (click for larger version):

The result is rather less impressive than for the papers from the NAR database issue. Although the word cloud does contain a good selection of words, it fails to convey the main message. I think a large part of the problem is the splitting of multiwords; for example, “cell cycle” becomes two separate terms “cell” and “cycle”. Another problem is that words from different sections of the paper are mixed, which blurs the messages. These two issues could be solved by 1) detecting multiwords and considering them as single tokens, and 2) sorting the terms according to where in the paper they are mainly used.

WebCiteCite this post

Commentary: Does size matter?

May 6, 2008

I recently took a look at colonization of titles and found that the fraction of papers with colons in their titles is increasing steadily. Intuitively, one would thus expect that the average length of the titles has also increased. The plot below shows that this is indeed the case (not that the y-axis does not begin at zero):

The average title length has increased from 8.5 words in 1950 to 12.5 words in 2008. Strangely, the increase is almost perfectly linear except for a fluctuation in the early 60s – I have no idea why this is the case.

But is the title length of a paper important? I personally expected that papers with short, catchy titles would be cited more than papers with longer, more complex titles. Lacking citation information for individual publications, I thus calculated average title length for publications from each journal and correlated it with the ISI impact factor of the corresponding journal:

No correlation is observed between the impact factor of a journal and the average title length of the papers published therein. So we can conclude that – at least for titles of scientific papers – size does not matter.

WebCiteCite this post

Commentary: Colonization of titles

April 22, 2008

You have probably noticed that a high fraction of scientific papers have colons in their titles. Several people have written humorous commentaries on this. Although these authors clearly see the use of colons as a growing trend, they did not present hard evidence for the increase in the usage of colons in the titles of scientific publications.

Out of curiosity, I thus wrote a small script to count the fraction of papers in Medline that have colons in their titles for each of the past 25 years. The result is shown in the plot below (note that the y-axis does not start at zero):

The conclusion is very clear: the fraction of titles with colons has increased linearly from 15% to 24% over the past 20 years. One could object that this effect may be explained by the increase in apologies (which often have a title “Retraction: …”) or by the NAR special issues on databases and web servers (which contain hundreds papers with titles such as “YADB: yet another database”). However, these add up to less than 2% of the papers with colonized titles and are thus insufficient to explain the observed 9% increase.

WebCiteCite this post

Commentary: Viewing the cell cycle in a new light

April 6, 2008

Atsushi Miyawaki’s lab from RIKEN has recently published a Cell paper that describes a novel approach for how to monitor cell-cycle progression of individual cells:

Visualizing spatiotemporal dynamics of multicellular cell-cycle progression

The cell-cycle transition from G1 to S phase has been difficult to visualize. We have harnessed antiphase oscillating proteins that mark cell-cycle transitions in order to develop genetically encoded fluorescent probes for this purpose. These probes effectively label individual G1 phase nuclei red and those in S/G2/M phases green. We were able to generate cultured cells and transgenic mice constitutively expressing the cell-cycle probes, in which every cell nucleus exhibits either red or green fluorescence. We performed time-lapse imaging to explore the spatiotemporal patterns of cell-cycle dynamics during the epithelial-mesenchymal transition of cultured cells, the migration and differentiation of neural progenitors in brain slices, and the development of tumors across blood vessels in live mice. These mice and cell lines will serve as model systems permitting unprecedented spatial and temporal resolution to help us better understand how the cell cycle is coordinated with various biological events.

The clever idea was to fuse a red- and a green-emitting fluorescent protein to Cdt1 and Geminin, respectively. Cdt1 is ubiquitinated by SCFSkp2 at the onset of S phase, which causes it to be rapidly degraded by the proteasome, whereas Geminin is targeted for proteolytic degradation by APCCdh1 in late M phase. By fluorescent labeling of two proteins, Miyawaki and colleagues managed to make mouse cells that become increasingly red during G1 phase, yellow around the G1/S transition, and increasingly green through S, G2, and M phase. It is thus possible to monitor the cell-cycle states of individual cells with a microscope.

The movie below follows a few HeLa cells for 3-4 cell cycles:

The authors also show how their construct can be used for imaging the cell-cycle state of the cells in a slice of a mouse brain or a mouse embryo. I expect that this will become an indispensable tool for unraveling the links between cell-cycle control and developmental processes.

For more details, I strongly recommend that you read Jake Young’s post at Pure Pedantry.

WebCiteCite this post

Commentary: Much ado about alignments

March 30, 2008

There seems to be a new trend in computational biology: worrying about sequence alignments. Over the past couple of months, two high-profile papers have appeared that flaws related to sequence alignment methods.

The first paper appeared in Science Magazine in January this year. Wong and coworkers describe how uncertainties in multiple alignments can lead to errors in different phylogenetic trees:

Alignment Uncertainty and Genomic Analysis

The statistical methods applied to the analysis of genomic data do not account for uncertainty in the sequence alignment. Indeed, the alignment is treated as an observation, and all of the subsequent inferences depend on the alignment being correct. This may not have been too problematic for many phylogenetic studies, in which the gene is carefully chosen for, among other things, ease of alignment. However, in a comparative genomics study, the same statistical methods are applied repeatedly on thousands of genes, many of which will be difficult to align. Using genomic data from seven yeast species, we show that uncertainty in the alignment can lead to several problems, including different alignment methods resulting in different conclusions.

The second paper appeared in Nature Biotechnology. Styczynski and coworkers discovered that the most commonly used substitution matrix, BLOSUM62, was calculated wrongly:

BLOSUM62 miscalculations improve search performance

The BLOSUM family of substitution matrices, and particularly BLOSUM62, is the de facto standard in protein database searches and sequence alignments. In the course of analyzing the evolution of the Blocks database, we noticed errors in the software source code used to create the initial BLOSUM family of matrices (available online). The result of these errors is that the BLOSUM matrices — BLOSUM62, BLOSUM50, etc. — are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.

Upon casual reading of these publications, one could get the idea that over a decade of work based on alignments, sequence similarity searches, and molecular evolution is wrong. Fortunately, this does not appear to be the case.

Starting with the second paper, I applaud the authors for discovering a mistake in such an established method, and I agree with them that it is remarkable that it has not been noticed before. However, I do not think that it is surprising that the ‘incorrect’ matrices work very well. Although they were not calculated as intended, the BLOSUM matrices have become the de facto standard precisely because they work as well as they do.

Regarding the first paper, I think it is fair to say that anyone working on multiple alignments and phylogeny are well aware that uncertain alignments can lead to wrong phylogenetic trees. This is why almost everyone uses programs like Gblocks to remove the ambiguous parts of their alignments before moving on to constructing phylogenetic trees. Unfortunately, Wong et al. instead constructed two sets of trees for each of the six multiple alignment methods: one based on the complete alignments, and one in which they excluded all gapped sites from the phylogenetic analysis. The latter is not equivalent to using a blocked alignment, since not all ambiguously aligned sites contain gaps, and since not all sites with gaps are ambiguously aligned.

Wong and coworkers subsequently compared the trees that they obtained using the six different alignment programs and found disagreements for almost half of all yeast proteins. This number may sound shockingly high, but I find it to be misleading in several ways. First, “disagreement” was defined as at least one of the six trees disagreeing with the others – much of the disagreement could thus be due to a single poorly performing alignment program. This definition also implies that the results can only get worse by adding more alignment methods to the comparison. Second, the comparison was not limited to the trees that are supported by bootstrap analysis – much of the disagreement is thus due to trees that we already know should not be trusted.

In my view, it would be more fair to make the comparison along the following lines:

  • Align the sequences as done by Wong et al.
  • Remove ambiguously aligned sites with Gblocks
  • Construct phylogenetic trees based on the blocked alignments
  • Calculate the bootstrap support for each tree
  • Discard trees with poor bootstrap support
  • Calculate the agreement on tree topology for each pair of alignment methods

This procedure will ensure that trees are not distorted by the unreliable parts of the alignments, that comparisons are not based on trees we know are unreliable, that the results are not skewed by a single poorly performing alignment method, and that the numbers remain comparable if more alignment methods are added. I have already downloaded all the alignments and run then through Gblocks; please let me know if you would like to continue the analysis from that step, and I will arrange a way to transfer the files.

Time might prove me wrong, but I expect that such an analysis will show that alignment uncertainty is not a major factor that needs to be taken into account when constructing phylogenetic trees.

WebCiteCite this post

Follow

Get every new post delivered to your Inbox.

Join 445 other followers