Tag Archives: alignment

Announcement: Computational analysis of protein-protein interactions for bench biologists

Once again I will be one of the teachers on an EMBO Practical Course. This time we will be teaching wet-lab biologists about how to do computational analysis of protein-protein interactions. The course will take place September 2-8 at the Max Delbrück Center for Molecular Medicine in Berlin, Germany.

The course aims to help bench scientists become more effective at exploiting the wide range of commonly-used databases and bioinformatics tools that can be used to identify, understand, and predict protein interactions by analyzing their structure, sequences, and other features.

The target group for the course are experimental scientists needing to analyse interaction data in their work, and who have limited experience using bioinformatics tools and resources. The course covers analyses and tools that are applied after potential interactions have been identified. It does not cover analysis of the raw data from, for example, mass spectrometry.

To apply for the course, fill in the online application form. The registration deadline is Friday June 15th 2012. The course fee is 200 euros for academics and 1000 euros for scientists from industry.

Advertisements

Commentary: The GPU computing fallacy

Modern graphics processors (GPUs) deliver considerably more brute force computational power than traditional processors (CPUs). With NVIDIA’s launch of CUDA, general purpose GPU computing has become greatly simplified, and many research groups around the world have consequently experimented with how one can harvest the power of GPUs to speed up scientific computing.

This is also the case for bioinformatics algorithms. NVIDIA advertises a number of applications that have been adapted to make use of GPUs, including several applications for bioinformatics and life sciences, which supposedly speed up bioinformatics algorithms by an order of magnitude or more.

In this commentary I will focus primarily on two GPU-accelerated versions of NCBI-BLAST, namely CUDA-BLAST and GPU-BLAST. I do so not to specifically criticize these two programs, but because BLAST is the single most commonly used bioinformatics tool and thus a prime example for illustrating whether GPU acceleration of bioinformatics algorithms pays off.

Whereas CUDA-BLAST to the best of my knowledge has not been published in a peer-reviewed journal, GPU-BLAST is described in the following Bioinformatics paper by Vouzis and Sahinidis:

GPU-BLAST: using graphics processors to accelerate protein sequence alignment

Motivation: The Basic Local Alignment Search Tool (BLAST) is one of the most widely used bioinformatics tools. The widespread impact of BLAST is reflected in over 53 000 citations that this software has received in the past two decades, and the use of the word ‘blast’ as a verb referring to biological sequence comparison. Any improvement in the execution speed of BLAST would be of great importance in the practice of bioinformatics, and facilitate coping with ever increasing sizes of biomolecular databases.

Results: Using a general-purpose graphics processing unit (GPU), we have developed GPU-BLAST, an accelerated version of the popular NCBI-BLAST. The implementation is based on the source code of NCBI-BLAST, thus maintaining the same input and output interface while producing identical results. In comparison to the sequential NCBI-BLAST, the speedups achieved by GPU-BLAST range mostly between 3 and 4.

It took me a while to figure out from where the 3-4x speedup came. I eventually found it in Figure 4B of the paper. GPU-BLAST achieves an approximately 3.3x speedup over NCBI-BLAST in only one situation, namely if it is used to perform ungapped sequence similarity searches and only one of six CPU cores is used:

Speedup of GPU-BLAST over NCBI-BLAST as function of number of CPU threads used. Figure by Vouzis and Sahinidis.

The vast majority of use cases for BLAST require gapped alignments, however, in which case GPU-BLAST never achieves even a 3x speedup on the hardware used by the authors. Moreover, nobody concerned about the speed of BLAST would buy a multi-core server and leave all but one core idle. The most relevant speedup is thus the speedup achieved by using all CPU cores and the GPU vs. only the CPU cores, in which case GPU-BLAST achieves only a 1.5x speedup over NCBI-BLAST.

The benchmark by NVIDIA does not fair much better. Their 10x speedup comes from comparing CUDA-BLAST to NCBI-BLAST using only a single CPU core. The moment one compares to NCBI-BLAST running with 4 threads on their quad-core Intel i7 CPU, the speedup drops to 3x. However, the CPU supports hyperthreading. To get the full performance out of it, one should thus presumably run NCBI-BLAST with 8 threads, which I estimate will reduce the speedup of CUDA-BLAST vs. NCBI-BLAST to 2.5x at best.

Even these numbers are not entirely fair. They are based on the 3.5x or 4x speedup that one gets by running a single instance of BLAST with 4 or 6 threads, respectively. The typical situation when the speed of BLAST becomes relevant, however, is when you have a large number of sequences that need to be searched against a database. This is an embarrassingly parallel problem; by partitioning the query sequences and running multiple single-threaded instances of BLAST, you can get a 6x speedup on either platform (personal experience shows that running 8 simultaneous BLAST searches on a quad-core CPU with hyperthreading gives approximately 6x speedup).

It is not just BLAST

Optimists could argue that perhaps BLAST is just one of few bioinformatics problems that do not benefit from GPU computing. However, reading the recent literature, I think that GPU-BLAST is a representative example. Most publications about GPU acceleration of algorithms relevant to bioinformatics report speedups of at most 10x. Typically, this performance number represents the speedup that can be attained relative to a single-threaded version of the program running on the CPU, hence leaving most of the CPU cores standing idle. Not exactly a fair comparison.

Davis et al. recently published a sobering paper in Bioinformatics in which they made exactly that point:

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

Motivation: Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research.

Results: Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations.

Kudos for that. They could have published yet another paper with the title “N-fold speedup of algorithm X by GPU computing”. Instead they honestly reported that if one puts the same effort into parallelizing the CPU implementation as it takes to write a massively parallel GPU implementation, one gets about the same speedup.

GPUs cost money

It gets worse. Almost all papers on GPU computing ignore the detail that powerful GPU cards are expensive. It is not surprising that you can make an algorithm run faster by buying a piece of hardware that costs as much if not more than the computer itself. You could have spent that money buying a second computer instead. What matters is not the performance but the price/performance ratio. You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”.

Let us have a quick look at the hardware platforms used for benchmarking the two GPU-accelerated implementations of BLAST. Vouzis and Sahinidis used a server with an Intel Xeon X5650 CPU, which I was able to find for under $3000. For acceleration they used a Tesla C2050 GPU card, which costs another $2500. The hardware necessary to make BLAST ~1.5x faster made the computer ~1.8x more expensive. NVIDIA used a different setup consisting of a server equipped with an Intel i7-920, which I could find for $1500, and two Tesla C1060 GPU cards costing $1300 each. In other words, they used a 2.7x more expensive computer to make BLAST 2.5x faster at best. The bottom line is that the increase in hardware costs outstripped the speed increase in both cases.

But what about the energy savings?

… I hear the die-hard GPU-computing enthusiasts cry. One of the selling arguments for GPU computing is that GPUs are much more energy efficient than CPUs. I will not question the fact that the peak Gflops delivered by a GPU exceeds that of CPUs using the same amount of energy. But does this theoretical number translate into higher energy efficiency when applied to a real-world problem such as BLAST?

The big fan on an NVIDIA Tesla GPU card is not there for show. (Picture from NVIDIA’s website.)

As anyone who has build a gaming computer in recent years can testify, modern day GPUs use as much electrical power as a CPU if not more. NVIDIA Tesla Computing Processors are no exception. The two Tesla C1060 cards in the machine used by NVIDIA to benchmark CUDA-BLAST use 187.7 Watts each, or 375.6 Watts in total. By comparison a basic Intel i7 system like the one used by NVIDIA uses less than 200 Watts. The two Tesla C1060 cards thus triple the power consumption while delivering at most 2.5 times the speed. Similarly, the single Tesla C2050 card used by Vouzis and Sahinidis uses 238 Watts, which is around the same as the power requirement of their base hexa-core Intel Xeon system, thereby doubling the power consumption for less than a 1.5-fold speedup. In other words, using either of the two GPU-accelerated versions of BLAST appears to be less energy efficient than using NCBI-BLAST.

Conclusions

Many of the claims regarding speedup of various bioinformatics algorithms using GPU computing are based on faulty comparisons. Typically, the massively parallel GPU implementation of an algorithm is compared to a serial version that makes use of only a fraction of the CPU’s compute power. Also, the considerable costs associated with GPU computing processors, both in terms of initial investment and power consumption, are usually ignored. Once all of this has been corrected for, GPU computing presently looks like a very bad deal.

There is a silver lining, though. First, everyone uses very expensive Tesla boards in order to achieve the highest possible speedup over the CPU implementations, whereas high-end gaming graphics cards might provide better value for money. However, the evidence for this remains to be seen. Second, certain specific problems such as molecular dynamics probably benefit more from GPU acceleration than BLAST does. In that case, you should be aware that you are buying hardware to speed up one specific type of analysis rather than bioinformatics analyses in general. Third, it is difficult to make predictions – especially about the future. It is possible that future generations of GPUs will change the picture, but that is no reason for buying expensive GPU accelerators today.

The message then is clear. If you are a bioinformatician who likes to live on the bleeding edge while wasting money and electricity, get a GPU compute server. If on the other hand you want something generally useful and well tested and quite a lot faster than a GPU compute server … get yourself some computers.

Analysis: Markov clustering and the case of the nonhomologous orthologs

In the previous blog post I described how the MCL algorithm can sometimes produce unnatural clusters with disconnected parts. The C implementation of MCL has an option to suppress this behavior (--force-connected=y), but I suspect that it is rarely used. I have thus taken a closer look at some notable applications of MCL in bioinformatics to see if unnatural clusters arise in real data sets.

Here I will focus on OrthoMCL-DB, which is a database of orthologous groups of protein sequences. These were constructed by applying the MCL algorithm to the normalized results of an all-against-all BLAST search of the protein sequences.

To check the connectivity of the resulting orthologous groups, I downloaded OrthoMCL version 4 including the 13+ GB of gzipped BLAST results that formed the basis for the MCL clustering. I wish to thanks to the OrthoMCL-DB team for being very helpful and making this large data set available to me.

A few Perl scripts and CPU hours later, Albert Palleja and I had extracted the BLAST network for each of the 116,536 orthologous groups and performed single-linkage clustering to check if any of them contained disconnected parts. We found that this was the case for the following 28 orthologous groups:

Orthologous group Protein
OG4_10123 tcru|Tc00.1047053448329.10
OG4_10133 cmer|CMS291C
OG4_11608 bmor|BGIBMGA011561
OG4_13082 lbic|eu2.Lbscf0004g03370
OG4_17434 cint|ENSCINP00000028818
nvec|e_gw.40.282.1
OG4_20715 mbre|fgenesh2_pg.scaffold_4000474
OG4_20953 tpal|NP_218832
OG4_21182 tvag|TVAG_333570
OG4_24433 tmar|NP_229533
OG4_29163 tcru|Tc00.1047053508221.76
OG4_32884 gzea|FGST_11535
OG4_36484 cbri|WBGene00088730
cjej|YP_002344482
OG4_39391 ddis|DDB_G0279421
OG4_43780 cpar|cgd3_1080
OG4_44179 atha|NP_177880
OG4_44684 bmal|YP_104794
rbal|NP_868387
OG4_45409 rcom|29647.m002000
rcom|29848.m004679
OG4_50671 pram|C_scaffold_62000023
OG4_50712 bpse|YP_331887.1
OG4_52326 bmaa|14961.m05365
OG4_52455 bmal|YP_338428
OG4_55725 apis|XP_001952076
OG4_57272 bbov|XP_001610684.1
OG4_58797 hwal|YP_659316
OG4_61264 crei|122343
OG4_68577 bmor|BGIBMGA000864
OG4_71107 cbur|NP_819756
OG4_84041 tcru|Tc00.1047053479883.10

For convenience, the orthologous groups are linked to the corresponding web pages in OrthoMCL-DB, which enable viewing of Pfam domain architectures and multiple sequence alignments. Cursory inspection suggests that the majority of the of the sequences listed in the table do not belong to the orthologous groups in question.

Of the 28 orthologous groups, 24 groups contain a single protein with no BLAST hits to other group members, 2 groups each contain 2 such singletons, and the remaining 2 groups each contain 2 proteins that show weak similarity to each other but not to any other group members. The latter proteins are highlighted in red.

In summary, this analysis shows that the unnatural clustering by MCL reported for a toy example in the previous post also affects the results of real-world bioinformatics applications of the algorithm.

Commentary: Much ado about alignments

There seems to be a new trend in computational biology: worrying about sequence alignments. Over the past couple of months, two high-profile papers have appeared that flaws related to sequence alignment methods.

The first paper appeared in Science Magazine in January this year. Wong and coworkers describe how uncertainties in multiple alignments can lead to errors in different phylogenetic trees:

Alignment Uncertainty and Genomic Analysis

The statistical methods applied to the analysis of genomic data do not account for uncertainty in the sequence alignment. Indeed, the alignment is treated as an observation, and all of the subsequent inferences depend on the alignment being correct. This may not have been too problematic for many phylogenetic studies, in which the gene is carefully chosen for, among other things, ease of alignment. However, in a comparative genomics study, the same statistical methods are applied repeatedly on thousands of genes, many of which will be difficult to align. Using genomic data from seven yeast species, we show that uncertainty in the alignment can lead to several problems, including different alignment methods resulting in different conclusions.

The second paper appeared in Nature Biotechnology. Styczynski and coworkers discovered that the most commonly used substitution matrix, BLOSUM62, was calculated wrongly:

BLOSUM62 miscalculations improve search performance

The BLOSUM family of substitution matrices, and particularly BLOSUM62, is the de facto standard in protein database searches and sequence alignments. In the course of analyzing the evolution of the Blocks database, we noticed errors in the software source code used to create the initial BLOSUM family of matrices (available online). The result of these errors is that the BLOSUM matrices — BLOSUM62, BLOSUM50, etc. — are quite different from the matrices that should have been calculated using the algorithm described by Henikoff and Henikoff. Obviously, minor errors in research, and particularly in software source code, are quite common. This case is noteworthy for three reasons: first, the BLOSUM matrices are ubiquitous in computational biology; second, these errors have gone unnoticed for 15 years; and third, the ‘incorrect’ matrices perform better than the ‘intended’ matrices.

Upon casual reading of these publications, one could get the idea that over a decade of work based on alignments, sequence similarity searches, and molecular evolution is wrong. Fortunately, this does not appear to be the case.

Starting with the second paper, I applaud the authors for discovering a mistake in such an established method, and I agree with them that it is remarkable that it has not been noticed before. However, I do not think that it is surprising that the ‘incorrect’ matrices work very well. Although they were not calculated as intended, the BLOSUM matrices have become the de facto standard precisely because they work as well as they do.

Regarding the first paper, I think it is fair to say that anyone working on multiple alignments and phylogeny are well aware that uncertain alignments can lead to wrong phylogenetic trees. This is why almost everyone uses programs like Gblocks to remove the ambiguous parts of their alignments before moving on to constructing phylogenetic trees. Unfortunately, Wong et al. instead constructed two sets of trees for each of the six multiple alignment methods: one based on the complete alignments, and one in which they excluded all gapped sites from the phylogenetic analysis. The latter is not equivalent to using a blocked alignment, since not all ambiguously aligned sites contain gaps, and since not all sites with gaps are ambiguously aligned.

Wong and coworkers subsequently compared the trees that they obtained using the six different alignment programs and found disagreements for almost half of all yeast proteins. This number may sound shockingly high, but I find it to be misleading in several ways. First, “disagreement” was defined as at least one of the six trees disagreeing with the others – much of the disagreement could thus be due to a single poorly performing alignment program. This definition also implies that the results can only get worse by adding more alignment methods to the comparison. Second, the comparison was not limited to the trees that are supported by bootstrap analysis – much of the disagreement is thus due to trees that we already know should not be trusted.

In my view, it would be more fair to make the comparison along the following lines:

  • Align the sequences as done by Wong et al.
  • Remove ambiguously aligned sites with Gblocks
  • Construct phylogenetic trees based on the blocked alignments
  • Calculate the bootstrap support for each tree
  • Discard trees with poor bootstrap support
  • Calculate the agreement on tree topology for each pair of alignment methods

This procedure will ensure that trees are not distorted by the unreliable parts of the alignments, that comparisons are not based on trees we know are unreliable, that the results are not skewed by a single poorly performing alignment method, and that the numbers remain comparable if more alignment methods are added. I have already downloaded all the alignments and run then through Gblocks; please let me know if you would like to continue the analysis from that step, and I will arrange a way to transfer the files.

Time might prove me wrong, but I expect that such an analysis will show that alignment uncertainty is not a major factor that needs to be taken into account when constructing phylogenetic trees.

WebCiteCite this post