Commentary: The GPU computing fallacy

Modern graphics processors (GPUs) deliver considerably more brute force computational power than traditional processors (CPUs). With NVIDIA’s launch of CUDA, general purpose GPU computing has become greatly simplified, and many research groups around the world have consequently experimented with how one can harvest the power of GPUs to speed up scientific computing.

This is also the case for bioinformatics algorithms. NVIDIA advertises a number of applications that have been adapted to make use of GPUs, including several applications for bioinformatics and life sciences, which supposedly speed up bioinformatics algorithms by an order of magnitude or more.

In this commentary I will focus primarily on two GPU-accelerated versions of NCBI-BLAST, namely CUDA-BLAST and GPU-BLAST. I do so not to specifically criticize these two programs, but because BLAST is the single most commonly used bioinformatics tool and thus a prime example for illustrating whether GPU acceleration of bioinformatics algorithms pays off.

Whereas CUDA-BLAST to the best of my knowledge has not been published in a peer-reviewed journal, GPU-BLAST is described in the following Bioinformatics paper by Vouzis and Sahinidis:

GPU-BLAST: using graphics processors to accelerate protein sequence alignment

Motivation: The Basic Local Alignment Search Tool (BLAST) is one of the most widely used bioinformatics tools. The widespread impact of BLAST is reflected in over 53 000 citations that this software has received in the past two decades, and the use of the word ‘blast’ as a verb referring to biological sequence comparison. Any improvement in the execution speed of BLAST would be of great importance in the practice of bioinformatics, and facilitate coping with ever increasing sizes of biomolecular databases.

Results: Using a general-purpose graphics processing unit (GPU), we have developed GPU-BLAST, an accelerated version of the popular NCBI-BLAST. The implementation is based on the source code of NCBI-BLAST, thus maintaining the same input and output interface while producing identical results. In comparison to the sequential NCBI-BLAST, the speedups achieved by GPU-BLAST range mostly between 3 and 4.

It took me a while to figure out from where the 3-4x speedup came. I eventually found it in Figure 4B of the paper. GPU-BLAST achieves an approximately 3.3x speedup over NCBI-BLAST in only one situation, namely if it is used to perform ungapped sequence similarity searches and only one of six CPU cores is used:

Speedup of GPU-BLAST over NCBI-BLAST as function of number of CPU threads used. Figure by Vouzis and Sahinidis.

The vast majority of use cases for BLAST require gapped alignments, however, in which case GPU-BLAST never achieves even a 3x speedup on the hardware used by the authors. Moreover, nobody concerned about the speed of BLAST would buy a multi-core server and leave all but one core idle. The most relevant speedup is thus the speedup achieved by using all CPU cores and the GPU vs. only the CPU cores, in which case GPU-BLAST achieves only a 1.5x speedup over NCBI-BLAST.

The benchmark by NVIDIA does not fair much better. Their 10x speedup comes from comparing CUDA-BLAST to NCBI-BLAST using only a single CPU core. The moment one compares to NCBI-BLAST running with 4 threads on their quad-core Intel i7 CPU, the speedup drops to 3x. However, the CPU supports hyperthreading. To get the full performance out of it, one should thus presumably run NCBI-BLAST with 8 threads, which I estimate will reduce the speedup of CUDA-BLAST vs. NCBI-BLAST to 2.5x at best.

Even these numbers are not entirely fair. They are based on the 3.5x or 4x speedup that one gets by running a single instance of BLAST with 4 or 6 threads, respectively. The typical situation when the speed of BLAST becomes relevant, however, is when you have a large number of sequences that need to be searched against a database. This is an embarrassingly parallel problem; by partitioning the query sequences and running multiple single-threaded instances of BLAST, you can get a 6x speedup on either platform (personal experience shows that running 8 simultaneous BLAST searches on a quad-core CPU with hyperthreading gives approximately 6x speedup).

It is not just BLAST

Optimists could argue that perhaps BLAST is just one of few bioinformatics problems that do not benefit from GPU computing. However, reading the recent literature, I think that GPU-BLAST is a representative example. Most publications about GPU acceleration of algorithms relevant to bioinformatics report speedups of at most 10x. Typically, this performance number represents the speedup that can be attained relative to a single-threaded version of the program running on the CPU, hence leaving most of the CPU cores standing idle. Not exactly a fair comparison.

Davis et al. recently published a sobering paper in Bioinformatics in which they made exactly that point:

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

Motivation: Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research.

Results: Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations.

Kudos for that. They could have published yet another paper with the title “N-fold speedup of algorithm X by GPU computing”. Instead they honestly reported that if one puts the same effort into parallelizing the CPU implementation as it takes to write a massively parallel GPU implementation, one gets about the same speedup.

GPUs cost money

It gets worse. Almost all papers on GPU computing ignore the detail that powerful GPU cards are expensive. It is not surprising that you can make an algorithm run faster by buying a piece of hardware that costs as much if not more than the computer itself. You could have spent that money buying a second computer instead. What matters is not the performance but the price/performance ratio. You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”.

Let us have a quick look at the hardware platforms used for benchmarking the two GPU-accelerated implementations of BLAST. Vouzis and Sahinidis used a server with an Intel Xeon X5650 CPU, which I was able to find for under $3000. For acceleration they used a Tesla C2050 GPU card, which costs another $2500. The hardware necessary to make BLAST ~1.5x faster made the computer ~1.8x more expensive. NVIDIA used a different setup consisting of a server equipped with an Intel i7-920, which I could find for $1500, and two Tesla C1060 GPU cards costing $1300 each. In other words, they used a 2.7x more expensive computer to make BLAST 2.5x faster at best. The bottom line is that the increase in hardware costs outstripped the speed increase in both cases.

But what about the energy savings?

… I hear the die-hard GPU-computing enthusiasts cry. One of the selling arguments for GPU computing is that GPUs are much more energy efficient than CPUs. I will not question the fact that the peak Gflops delivered by a GPU exceeds that of CPUs using the same amount of energy. But does this theoretical number translate into higher energy efficiency when applied to a real-world problem such as BLAST?

The big fan on an NVIDIA Tesla GPU card is not there for show. (Picture from NVIDIA’s website.)

As anyone who has build a gaming computer in recent years can testify, modern day GPUs use as much electrical power as a CPU if not more. NVIDIA Tesla Computing Processors are no exception. The two Tesla C1060 cards in the machine used by NVIDIA to benchmark CUDA-BLAST use 187.7 Watts each, or 375.6 Watts in total. By comparison a basic Intel i7 system like the one used by NVIDIA uses less than 200 Watts. The two Tesla C1060 cards thus triple the power consumption while delivering at most 2.5 times the speed. Similarly, the single Tesla C2050 card used by Vouzis and Sahinidis uses 238 Watts, which is around the same as the power requirement of their base hexa-core Intel Xeon system, thereby doubling the power consumption for less than a 1.5-fold speedup. In other words, using either of the two GPU-accelerated versions of BLAST appears to be less energy efficient than using NCBI-BLAST.

Conclusions

Many of the claims regarding speedup of various bioinformatics algorithms using GPU computing are based on faulty comparisons. Typically, the massively parallel GPU implementation of an algorithm is compared to a serial version that makes use of only a fraction of the CPU’s compute power. Also, the considerable costs associated with GPU computing processors, both in terms of initial investment and power consumption, are usually ignored. Once all of this has been corrected for, GPU computing presently looks like a very bad deal.

There is a silver lining, though. First, everyone uses very expensive Tesla boards in order to achieve the highest possible speedup over the CPU implementations, whereas high-end gaming graphics cards might provide better value for money. However, the evidence for this remains to be seen. Second, certain specific problems such as molecular dynamics probably benefit more from GPU acceleration than BLAST does. In that case, you should be aware that you are buying hardware to speed up one specific type of analysis rather than bioinformatics analyses in general. Third, it is difficult to make predictions – especially about the future. It is possible that future generations of GPUs will change the picture, but that is no reason for buying expensive GPU accelerators today.

The message then is clear. If you are a bioinformatician who likes to live on the bleeding edge while wasting money and electricity, get a GPU compute server. If on the other hand you want something generally useful and well tested and quite a lot faster than a GPU compute server … get yourself some computers.

16 thoughts on “Commentary: The GPU computing fallacy

  1. Pingback: Tweets that mention Commentary: The GPU computing fallacy « Buried Treasure -- Topsy.com

  2. Pingback: The GPU hype and bioinformatics algorithms | BioMCMC

  3. csgillespie

    Nice article. I’m rapidly coming to the same conclusion as you. A few comments though:

    1. “…if one puts the same effort into parallelizing the CPU implementation…” Another key point is that it is very **much** easier to write parallel CPU code (say with openMP) then it is to write CUDA code.

    2. You also need a dedicated machine to put your graphics card in.

    3. My only hesitancy is that most of the current “super-computers” are GPU machines. But I’m not really sure what that implies for **my** scientific computing needs.

    Reply
  4. Lars Juhl Jensen Post author

    Thanks a lot. One thing that I didn’t make sufficiently clear: when I talked about putting similar effort into parallelizing the CPU implementation, I was not thinking along the lines of OpenMP and such. I was thinking about vectorizing algorithms using SSE, which is really the CPU equivalent of the vectorization that is done on GPUs. In other words, I am talking about how to squeeze the most performance out of a single CPU (and thereby improve price performance), not how to parallelize over multiple CPUs or compute nodes.

    The reason the latter is – in my opinion – not as interesting is that most bioinformatics problems are trivial to parallelize. You don’t even need shared memory machines or message parsing libraries; you can simply partition your input data into N chunks and process these as N independent jobs on a cluster.

    Reply
    1. jcmoure

      Memory and I/O bandwidth can be a concern. Using more cores inside a processor die in the trivial way you propose multiplies memory and I/O requirements, which can become a bottleneck (that’s the reason why linear speedup is not easy to achieve).
      A CPU-intensive program may not saturate Memory Bandwidth, but as you optimize CPU usage (with vectorization -SSE-, careful handling of conditional branches, and so on) memory and I/O bandwith may become the critical issue. That’s the case, for example, for read mapping applications.

    2. Lars Juhl Jensen Post author

      You may be right for some applications, but when it comes to NCBI BLAST, the memory certainly does not become a bottleneck. When I say that running 8 copies of BLAST on a quad-core hyperthreading Intel processor gives approximately 6x speedup over a single copy, it is not a hypothetical number. It is the actual real-world speedup that I observe when doing just that.

    3. jcmoure

      I am recently working on a degree project that evaluates the performance of NCBI BLASTp, and we have already identified some improvements that can increase speed at the expense of higher memory BW demands. Many bioinformatics applications do not fully exploit CPU resurces.
      I just want to point out that optimizing is not a trivial task. GPUs have many drawbacks, and you cannot expect 100x speedups from them. But hard work and balanced CPU+GPU collaboration is still a promising path for improvements.

    4. Lars Juhl Jensen Post author

      I think we completely agree. Many programs are not written to optimally utilize the CPU, and one can thus easily get misleading results by comparing a heavily optimized GPU implementation to a suboptimal CPU implementation. And no, one should not expect 100x speedup – the big question is just why people repeatedly claim such massive speedups when one cannot expect it and a fair comparison shows that they do not attain it.

  5. cmbergman

    Very interesting analysis. Shortly after reading your post I stumbled across the following article from researchers at Intel that supports your argument:

    Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU (http://portal.acm.org/citation.cfm?id=1816021)

    These authors “analyzed the performance of an important set of throughput computing kernels on Intel Core i7-960 and Nvidia GTX280. We show that CPUs and GPUs are much closer in performance (2.5X) than the previously reported orders of magnitude difference. We believe many factors contributed to the reported large what optimizations are applied to the code.”

    Reply
  6. Pingback: Commentary: Intel’s take on GPU computing « Buried Treasure

  7. stochasticfly

    Very interesting post and comments.

    Ling & Benkrid (ICCS 2010) doi:10.1016/j.procs.2010.04.053 report only about 2x speedup of gapped BLAST on GPU compared to an optimized CPU implementation. They note, however, that a GPU implementation is likely to be more advantageous in the case of Smith-Waterman. This is explained by the higher operations-to-transfer ratio of SW.

    It seems BLAST simply does not make such a good use of a GPU implementation.

    Reply
  8. Pingback: Massively-Parallel Computational Analysis of Brain Recordings :: Future Technology Trends

  9. Ag Wole

    i agree whole-heartedly with the author. a gpu is a massively parallel vector processor. as such, the types of data processing that can deliver the 100X speedups generally have the following characteristics:

    1) immutable input vectors (int, float, double)
    2) zero (or near zero) dependency Aarithmetic calculations that can be performed in parallel for each element in the output vector.

    the simplest example of this is vector addition vC = vA[1,1,1,1] + vB[2,2,2,2], in which four additions are performed in parallel.

    while sequence alignment can be thought of as an independent operation, it generally does not satisfy the criteria that i listed above and, thus, does not map as well to gpu hardware as, say, massive matrix multiplication, which will easily give you a 200X speedup over multithreaded and vectorized cpu code.

    Reply
  10. José Gerontolous

    I confirm this affirmation.
    I have a non-simmetric algorithm, and the multithread model is much more efficient than gpu model, because of the neccessary independency between the threads.
    In this case, i use 1600 threads, in different sizes, and the time to execution is variable for each one. In this scenario, GPU is not adequate to resolve the problem.
    GPUs works very well when all the threads are simmetrical, possibiliting a real paralellism, but not for general purpose.

    Reply
  11. Pingback: Commentary: GPU vs. CPU comparison done right | Buried Treasure

Leave a reply to cmbergman Cancel reply