A week or two ago, I published a post in which I argued that most papers, which report order of magnitude speedup of a bioinformatics algorithm by using graphics processors (GPUs), did so based straw man comparisons:
- Massively parallel GPU implementations were compared to CPU implementations that did not make full use of the multi-core and SIMD (Single Instruction, Multiple Data) features.
- The performance comparisons were done using very expensive GPU processing cards that cost as much, if not more, than the host computers.
It turns out that Lee and coworkers from Intel Corporation have performed a comparison that addresses both of these issues (thanks to Casey Bergman for making me aware of this). It appeared in 2010 in the proceedings of the 37th International Symposium on Computer Architecture:
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
Without wanting to question the integrity of the researchers, I read the paper with a critical mind for an obvious reason: they work for Intel who is a major player on the CPU market, but not on the GPU market. One always needs to have a critical mind when economic interests are involved. However, it is my clear impression that the researchers did their very best to make each and every algorithm run as fast as possible on the GPU as well as on the CPU.
The only factor I found, which may have tweaked the balance a bit in the favor of the CPU, was the choice of CPU and GPU. Whereas the Intel Core i7 960 was launched in October 2009, the Nvidia GTX 280 was launched in June 2008. That is a difference of 16 months, which by application of Moore’s law skews the results by almost 2x in favor of the CPU. The average speed-up provided by a high-end gaming GPU over a high-end CPU on this selection of algorithms is thus likely to be 4-5x. However, this advantage drops to about 3-4x if one corrects for the additional cost of the GPU, and to around 2x if one corrects for energy efficiency rather than for initial investment.
The findings of Lee and coworkers are consistent with my own conclusions, which were based on comparing two GPU-accelerated implementations of BLAST. In case of BLAST, the price-performance of GPU implementations ended up worse than that of the CPU implementation. Lee and coworkers found that for a wide variety of highly data-parallel algorithms (none of which are directly related to bioinformatics), only a modest speedup was attained. Not even a single algorithm got anywhere close to the promises of 100x or 1000x speedup, and a couple of algorithms ended up being slower on the GPU than on the CPU. This confirms my view that GPUs are presently not an attractive alternative to CPUs for most scientific computing needs.