Category Archives: Commentary

Commentary: The sad tale of MutaDATABASE

The problem of bioinformatics web resources dying or moving is well known. It has been quantified in two interesting papers by Jonathan Wren entitled “404 not found: the stability and persistence of URLs published in MEDLINE” and “URL decay in MEDLINE — a 4-year follow-up study”. There is also a discussion on the topic at Biostar.

The resources discussed in these papers at least existed in an operational form at the time of publication, even if they have since perished. The same cannot be said about MutaDATABASE, which in 2011 was published in Nature Biotechnology as a correspondence entitled “MutaDATABASE: a centralized and standardized DNA variation database”. Fellow blogger Neil Saunders was quick to pick up on the fact that this database was an empty shell, but generously gave the authors the benefit of the doubt in his closing statement:

Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.

Now, almost five years after the original publication, I think it is fair to follow up. Unfortunately, MutaDATABASE did not turn out to be terrific. Instead, it turned out just not to be. In March 2014, about three years after the publication, www.mutadatabase.org looked like this:
MutaDATABASE in 2014

By the end of 2015, the website had mutated into this:
MutaDATABASE in 2015

To quote Joel Spolsky: “Shipping is a feature. A really important feature. Your product must have it.” This also applies to biological databases and other bioinformatics resources, which is why journals would be wise never to publish any resource without this crucial feature.

Commentary: Every cloud does not have a silver lining

It is often said that every cloud has a silver lining. Whereas it may be true that most clouds have a silver lining, I can report observational evidence that this does not hold true for all clouds. Specifically, it is not true for clouds located between the observer and a solar eclipse.

Commentary: Does it even matter whether you use Microsoft Word or LaTeX?

Shortly before Christmas, PLOS ONE published a paper comparing the efficiency of using Microsoft Word and LaTeX for document preparation:

An Efficiency Comparison of Document Preparation Systems Used in Academic Research and Development

The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors.

This study has been criticized for being rigged in various ways to favor Word over LaTeX, which may or may not be the case. However, in my opinion, the much bigger question is this: does the efficiency of the document preparation system used by a researcher even matter?

Most readers of this blog are probably familiar with performance optimization of software. The crucial first step is to profile the program to identify the parts of the code in which most time is being spent. The reason for doing profiling is, that optimization of other parts of the program will make hardly any difference to the overall runtime.

If we want to optimize the efficiency with which we publish research articles, I think it would be fruitful to adopt the same strategy. The first thing we need to do is thus to identify which parts of the process take the most time. In my experience, what takes by far the most time is the actual writing process, which includes reading related work that should be cited. The time spent on document preparation is insignificant compared to the time spent on authoring the text, and the efficiency of the software you use for this task is thus of little importance.

What, then, can you do to become more efficient at writing? My best advice is to start writing the manuscript as soon as you start on a project. Whenever you perform an analysis, document what you did in the Methods section. Whenever you read a paper that may be of relevance to the project, write a one- or two-sentence summary of it in the Introduction section and cite it. The text will look nothing like the final manuscript, but it will be an infinitely better starting point than that scary blank page.

Commentary: The 99% of scientific publishing

Last week, John P. A. Ioannidis from Stanford University and Kevin W. Boyack and Richard Klavans from SciTech Strategies, Inc published an interesting analysis of scientific authorships. In the PLOS ONE paper “Estimates of the Continuously Publishing Core in the Scientific Workforce” they describe a small influential core of <1% of researchers who publish each and every year. This analysis appears to have caught the attention of many, including Erik Stokstad from Science Magazine who wrote the short news story “The 1% of scientific publishing”.

You would be excused to think that I belong to the 1%. I published my first paper in 1998 and have published at least one paper every single year since then. However, it turns out that the 1% was defined as the researchers who had published at least one paper every year in the period 1996-2011. Since I published my first paper in 1998, I belong to the other 99% together with everyone else who started their publishing career after 1996 or stopped their career before 2011.

Although the number 1% is making the headlines, the authors seem to be aware of the issue. Of the 15,153,100 researchers with publications in the period 1996-2011, only 150,608 published all 16 years; however, the authors estimate an additional 16,877 scientists published every year in the period 1997-2012. A similar number of continuously publishing scientists will have started their careers all the other years from 1998-2011. Similarly, they an estimated 9,673 researchers stopped their long continuous publishing career in 2010, and presumably all other years in the period 1996-2009. In my opinion, a better estimate is thus that 150,608 + 15*16,877 + 15*9,673 = 548,858 of the 15,153,100 authors have had or will have a 16-year unbroken chain of publications. That amounts to something in the 3-4% range.

That number may still not sound impressive; however, this in no way implies that most researchers do not publish on a regular basis. To have a 16-year unbroken chain of publications, one almost has to stay in academia and become a principal investigator. Most people who publish at least one article and subsequently pursue a career in industry or teaching will count towards the 96-97%. And that is no matter how good a job they do, mind you.

Commentary: GPU vs. CPU comparison done right

I have in earlier posts complained about how some researchers, through unfair comparisons, make GPU computing look more attractive than it really is.

It is thus only appropriate to also commend those who do it right. As part of some ongoing research, I came across a paper published in Journal of Chemical Information and Modeling:

Anatomy of High-Performance 2D Similarity Calculations

Similarity measures based on the comparison of dense bit vectors of two-dimensional chemical features are a dominant method in chemical informatics. For large-scale problems, including compound selection and machine learning, computing the intersection between two dense bit vectors is the overwhelming bottleneck. We describe efficient implementations of this primitive as well as example applications using features of modern CPUs that allow 20–40× performance increases relative to typical code. Specifically, we describe fast methods for population count on modern x86 processors and cache-efficient matrix traversal and leader clustering algorithms that alleviate memory bandwidth bottlenecks in similarity matrix construction and clustering. The speed of our 2D comparison primitives is within a small factor of that obtained on GPUs and does not require specialized hardware.

Briefly, the authors compare the speed of with which fingerprint-based chemical similarity searches can be performed on CPUs and GPUs. In contrast to so many others, the authors went to great lengths to give a fair picture of the relative performance:

  • Instead of using multiple very expensive Nvidia Tesla boards, they used an Nvidia GTX 480. This card cost roughly $500 when released and was the fastest gaming card available at the time.
  • For comparison, they used an Intel i7-920. This CPU cost approximately $300 when released and was a high-end consumer product.
  • They compared the GPU implementation of the algorithm to a highly optimized CPU implementation. The CPU implementation makes use of SSE4.2 instructions available on modern Intel CPUs and is multi-threaded to utilize all CPU cores.

The end result was that the GPU implementation gives a respectable but non-exceptional 5x speed-up over a pure CPU implementation. If one further takes into account that the GPU is probably 40% of the cost of the whole computer, this reduces to a 3x improvement in price-performance ratio.

The authors conclude:

In summary: GPU coding requires one to think of the hardware, but high-speed CPU programming is the same; spending time optimizing CPU code at the same level of architectural complexity that would be used on the GPU often allows one to do quite well.

I can only agree wholeheartedly.

Commentary: Are other women a woman’s worst enemies in science?

It is clear that in science, we have a gender bias among leaders. It is my impression that most people think this is due to a combination of men and women having different priorities in life and high-ranking male professors favoring their own gender. Conversely, I have never heard anyone dare to suggest that women may be their own worst enemies in this context.

Benenson and coworkers from Emmanuel College have just published an interesting study in Current Biology on collaborations between full professors and assistant professors entitled “Rank influences human sex differences in dyadic cooperation”.

By tabulating the joint publications, they found 76 same-sex publications from male full professors, which should be compared to a random expectation of 61 such publications. By contrast they found only 14 same-sex publications from female full professors with the random expectation being 29. In other words, whereas male full professors collaborated 25% more with male assistant professors than expected, female full professors collaborated more than 50% less with female assistant professors than expected. The authors conclude:

Our results are consistent with observations suggesting that social structure takes differing forms for human males and females. Males’ tendency to interact in same-gender groups makes them more prone to cooperation with asymmetrically ranked males. In contrast, females’ tendency to restrict their same-gender interactions to equally ranked individuals make them more reluctant to cooperate with asymmetrically ranked females.

There is, in other words, a bias towards high-ranking professors of both genders to preferentially collaborate with lower-ranking male professors as opposed to lower-ranking female professors. If anything, that bias appears to be stronger in case of high-ranking female professors than high-ranking male professors.

Commentary: Coffee, a prerequisite for research?

Yesterday, I stumbled upon two links that I found interesting. The first was the map-based data visualization blog post 40 Maps That Will Help You Make Sense of the World, in which maps 24 and 28 hint at a correlation (click for larger interactive versions):

Number of Researchers per million inhabitants by Country

Current Worldwide Annual Coffee Consumption per capita

The first map shows the number of researchers per million inhabitants in each country. The second map shows the number of kg coffee consumed per capita per year. As ChartsBin allows you to download the data behind each map, I did so and produced a scatter plot that confirms the strong correlation (click for larger version):

coffee_vs_researchers

This confirms my view that the coffee machine is the most important piece of hardware in a bioinformatics group. Bioinformaticians with coffee can do work even without a computer, but bioinformaticians without coffee are unable to work, no matter how good computers they have.

One should of course be careful to not jump to conclusions about causality based on correlation. This leads me to the second link: a new study published in Nature Neuroscience, which shows that Post-study caffeine administration enhances memory consolidation in humans.

I optimistically await a similar study confirming the correlation between Chocolate Consumption, Cognitive Function, and Nobel Laureates published last year in New England Journal of Medicine.

Commentary: Intel’s take on GPU computing

A week or two ago, I published a post in which I argued that most papers, which report order of magnitude speedup of a bioinformatics algorithm by using graphics processors (GPUs), did so based straw man comparisons:

  • Massively parallel GPU implementations were compared to CPU implementations that did not make full use of the multi-core and SIMD (Single Instruction, Multiple Data) features.
  • The performance comparisons were done using very expensive GPU processing cards that cost as much, if not more, than the host computers.

It turns out that Lee and coworkers from Intel Corporation have performed a comparison that addresses both of these issues (thanks to Casey Bergman for making me aware of this). It appeared in 2010 in the proceedings of the 37th International Symposium on Computer Architecture:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Without wanting to question the integrity of the researchers, I read the paper with a critical mind for an obvious reason: they work for Intel who is a major player on the CPU market, but not on the GPU market. One always needs to have a critical mind when economic interests are involved. However, it is my clear impression that the researchers did their very best to make each and every algorithm run as fast as possible on the GPU as well as on the CPU.

The only factor I found, which may have tweaked the balance a bit in the favor of the CPU, was the choice of CPU and GPU. Whereas the Intel Core i7 960 was launched in October 2009, the Nvidia GTX 280 was launched in June 2008. That is a difference of 16 months, which by application of Moore’s law skews the results by almost 2x in favor of the CPU. The average speed-up provided by a high-end gaming GPU over a high-end CPU on this selection of algorithms is thus likely to be 4-5x. However, this advantage drops to about 3-4x if one corrects for the additional cost of the GPU, and to around 2x if one corrects for energy efficiency rather than for initial investment.

The findings of Lee and coworkers are consistent with my own conclusions, which were based on comparing two GPU-accelerated implementations of BLAST. In case of BLAST, the price-performance of GPU implementations ended up worse than that of the CPU implementation. Lee and coworkers found that for a wide variety of highly data-parallel algorithms (none of which are directly related to bioinformatics), only a modest speedup was attained. Not even a single algorithm got anywhere close to the promises of 100x or 1000x speedup, and a couple of algorithms ended up being slower on the GPU than on the CPU. This confirms my view that GPUs are presently not an attractive alternative to CPUs for most scientific computing needs.

Commentary: The GPU computing fallacy

Modern graphics processors (GPUs) deliver considerably more brute force computational power than traditional processors (CPUs). With NVIDIA’s launch of CUDA, general purpose GPU computing has become greatly simplified, and many research groups around the world have consequently experimented with how one can harvest the power of GPUs to speed up scientific computing.

This is also the case for bioinformatics algorithms. NVIDIA advertises a number of applications that have been adapted to make use of GPUs, including several applications for bioinformatics and life sciences, which supposedly speed up bioinformatics algorithms by an order of magnitude or more.

In this commentary I will focus primarily on two GPU-accelerated versions of NCBI-BLAST, namely CUDA-BLAST and GPU-BLAST. I do so not to specifically criticize these two programs, but because BLAST is the single most commonly used bioinformatics tool and thus a prime example for illustrating whether GPU acceleration of bioinformatics algorithms pays off.

Whereas CUDA-BLAST to the best of my knowledge has not been published in a peer-reviewed journal, GPU-BLAST is described in the following Bioinformatics paper by Vouzis and Sahinidis:

GPU-BLAST: using graphics processors to accelerate protein sequence alignment

Motivation: The Basic Local Alignment Search Tool (BLAST) is one of the most widely used bioinformatics tools. The widespread impact of BLAST is reflected in over 53 000 citations that this software has received in the past two decades, and the use of the word ‘blast’ as a verb referring to biological sequence comparison. Any improvement in the execution speed of BLAST would be of great importance in the practice of bioinformatics, and facilitate coping with ever increasing sizes of biomolecular databases.

Results: Using a general-purpose graphics processing unit (GPU), we have developed GPU-BLAST, an accelerated version of the popular NCBI-BLAST. The implementation is based on the source code of NCBI-BLAST, thus maintaining the same input and output interface while producing identical results. In comparison to the sequential NCBI-BLAST, the speedups achieved by GPU-BLAST range mostly between 3 and 4.

It took me a while to figure out from where the 3-4x speedup came. I eventually found it in Figure 4B of the paper. GPU-BLAST achieves an approximately 3.3x speedup over NCBI-BLAST in only one situation, namely if it is used to perform ungapped sequence similarity searches and only one of six CPU cores is used:

Speedup of GPU-BLAST over NCBI-BLAST as function of number of CPU threads used. Figure by Vouzis and Sahinidis.

The vast majority of use cases for BLAST require gapped alignments, however, in which case GPU-BLAST never achieves even a 3x speedup on the hardware used by the authors. Moreover, nobody concerned about the speed of BLAST would buy a multi-core server and leave all but one core idle. The most relevant speedup is thus the speedup achieved by using all CPU cores and the GPU vs. only the CPU cores, in which case GPU-BLAST achieves only a 1.5x speedup over NCBI-BLAST.

The benchmark by NVIDIA does not fair much better. Their 10x speedup comes from comparing CUDA-BLAST to NCBI-BLAST using only a single CPU core. The moment one compares to NCBI-BLAST running with 4 threads on their quad-core Intel i7 CPU, the speedup drops to 3x. However, the CPU supports hyperthreading. To get the full performance out of it, one should thus presumably run NCBI-BLAST with 8 threads, which I estimate will reduce the speedup of CUDA-BLAST vs. NCBI-BLAST to 2.5x at best.

Even these numbers are not entirely fair. They are based on the 3.5x or 4x speedup that one gets by running a single instance of BLAST with 4 or 6 threads, respectively. The typical situation when the speed of BLAST becomes relevant, however, is when you have a large number of sequences that need to be searched against a database. This is an embarrassingly parallel problem; by partitioning the query sequences and running multiple single-threaded instances of BLAST, you can get a 6x speedup on either platform (personal experience shows that running 8 simultaneous BLAST searches on a quad-core CPU with hyperthreading gives approximately 6x speedup).

It is not just BLAST

Optimists could argue that perhaps BLAST is just one of few bioinformatics problems that do not benefit from GPU computing. However, reading the recent literature, I think that GPU-BLAST is a representative example. Most publications about GPU acceleration of algorithms relevant to bioinformatics report speedups of at most 10x. Typically, this performance number represents the speedup that can be attained relative to a single-threaded version of the program running on the CPU, hence leaving most of the CPU cores standing idle. Not exactly a fair comparison.

Davis et al. recently published a sobering paper in Bioinformatics in which they made exactly that point:

Real-world comparison of CPU and GPU implementations of SNPrank: a network analysis tool for GWAS

Motivation: Bioinformatics researchers have a variety of programming languages and architectures at their disposal, and recent advances in graphics processing unit (GPU) computing have added a promising new option. However, many performance comparisons inflate the actual advantages of GPU technology. In this study, we carry out a realistic performance evaluation of SNPrank, a network centrality algorithm that ranks single nucleotide polymorhisms (SNPs) based on their importance in the context of a phenotype-specific interaction network. Our goal is to identify the best computational engine for the SNPrank web application and to provide a variety of well-tested implementations of SNPrank for Bioinformaticists to integrate into their research.

Results: Using SNP data from the Wellcome Trust Case Control Consortium genome-wide association study of Bipolar Disorder, we compare multiple SNPrank implementations, including Python, Matlab and Java as well as CPU versus GPU implementations. When compared with naïve, single-threaded CPU implementations, the GPU yields a large improvement in the execution time. However, with comparable effort, multi-threaded CPU implementations negate the apparent advantage of GPU implementations.

Kudos for that. They could have published yet another paper with the title “N-fold speedup of algorithm X by GPU computing”. Instead they honestly reported that if one puts the same effort into parallelizing the CPU implementation as it takes to write a massively parallel GPU implementation, one gets about the same speedup.

GPUs cost money

It gets worse. Almost all papers on GPU computing ignore the detail that powerful GPU cards are expensive. It is not surprising that you can make an algorithm run faster by buying a piece of hardware that costs as much if not more than the computer itself. You could have spent that money buying a second computer instead. What matters is not the performance but the price/performance ratio. You do not see anyone publishing papers with titles like “N-fold speed up of algoritm X by using N computers”.

Let us have a quick look at the hardware platforms used for benchmarking the two GPU-accelerated implementations of BLAST. Vouzis and Sahinidis used a server with an Intel Xeon X5650 CPU, which I was able to find for under $3000. For acceleration they used a Tesla C2050 GPU card, which costs another $2500. The hardware necessary to make BLAST ~1.5x faster made the computer ~1.8x more expensive. NVIDIA used a different setup consisting of a server equipped with an Intel i7-920, which I could find for $1500, and two Tesla C1060 GPU cards costing $1300 each. In other words, they used a 2.7x more expensive computer to make BLAST 2.5x faster at best. The bottom line is that the increase in hardware costs outstripped the speed increase in both cases.

But what about the energy savings?

… I hear the die-hard GPU-computing enthusiasts cry. One of the selling arguments for GPU computing is that GPUs are much more energy efficient than CPUs. I will not question the fact that the peak Gflops delivered by a GPU exceeds that of CPUs using the same amount of energy. But does this theoretical number translate into higher energy efficiency when applied to a real-world problem such as BLAST?

The big fan on an NVIDIA Tesla GPU card is not there for show. (Picture from NVIDIA’s website.)

As anyone who has build a gaming computer in recent years can testify, modern day GPUs use as much electrical power as a CPU if not more. NVIDIA Tesla Computing Processors are no exception. The two Tesla C1060 cards in the machine used by NVIDIA to benchmark CUDA-BLAST use 187.7 Watts each, or 375.6 Watts in total. By comparison a basic Intel i7 system like the one used by NVIDIA uses less than 200 Watts. The two Tesla C1060 cards thus triple the power consumption while delivering at most 2.5 times the speed. Similarly, the single Tesla C2050 card used by Vouzis and Sahinidis uses 238 Watts, which is around the same as the power requirement of their base hexa-core Intel Xeon system, thereby doubling the power consumption for less than a 1.5-fold speedup. In other words, using either of the two GPU-accelerated versions of BLAST appears to be less energy efficient than using NCBI-BLAST.

Conclusions

Many of the claims regarding speedup of various bioinformatics algorithms using GPU computing are based on faulty comparisons. Typically, the massively parallel GPU implementation of an algorithm is compared to a serial version that makes use of only a fraction of the CPU’s compute power. Also, the considerable costs associated with GPU computing processors, both in terms of initial investment and power consumption, are usually ignored. Once all of this has been corrected for, GPU computing presently looks like a very bad deal.

There is a silver lining, though. First, everyone uses very expensive Tesla boards in order to achieve the highest possible speedup over the CPU implementations, whereas high-end gaming graphics cards might provide better value for money. However, the evidence for this remains to be seen. Second, certain specific problems such as molecular dynamics probably benefit more from GPU acceleration than BLAST does. In that case, you should be aware that you are buying hardware to speed up one specific type of analysis rather than bioinformatics analyses in general. Third, it is difficult to make predictions – especially about the future. It is possible that future generations of GPUs will change the picture, but that is no reason for buying expensive GPU accelerators today.

The message then is clear. If you are a bioinformatician who likes to live on the bleeding edge while wasting money and electricity, get a GPU compute server. If on the other hand you want something generally useful and well tested and quite a lot faster than a GPU compute server … get yourself some computers.

Commentary: When Open Access isn’t

This week, PLoS ONE published an interesting paper by Bo-Christer Björk and coworkers on the free global availability of articles from scientific journals. One of the principal findings in this study is that 20.4% of articles published in 2008 are now available as Open Access (OA):

Open Access to the Scientific Journal Literature: Situation 2009

Background: The Internet has recently made possible the free global availability of scientific journal articles. Open Access (OA) can occur either via OA scientific journals, or via authors posting manuscripts of articles published in subscription journals in open web repositories. So far there have been few systematic studies showing how big the extent of OA is, in particular studies covering all fields of science.

Methodology/Principal Findings: The proportion of peer reviewed scholarly journal articles, which are available openly in full text on the web, was studied using a random sample of 1837 titles and a web search engine. Of articles published in 2008, 8,5% were freely available at the publishers’ sites. For an additional 11,9% free manuscript versions could be found using search engines, making the overall OA percentage 20,4%. Chemistry (13%) had the lowest overall share of OA, Earth Sciences (33%) the highest. In medicine, biochemistry and chemistry publishing in OA journals was more common. In all other fields author-posted manuscript copies dominated the picture.

Conclusions/Significance: The results show that OA already has a significant positive impact on the availability of the scientific journal literature and that there are big differences between scientific disciplines in the uptake. Due to the lack of awareness of OA-publishing among scientists in most fields outside physics, the results should be of general interest to all scholars. The results should also interest academic publishers, who need to take into account OA in their business strategies and copyright policies, as well as research funders, who like the NIH are starting to require OA availability of results from research projects they fund. The method and search tools developed also offer a good basis for more in-depth studies as well as longitudinal studies.

Having just set up a mirror of the OA subset of PubMed Central, I know that it contains only ~10% of the articles deposited in PubMed Central and only ~1% of the articles indexed by PubMed. It was thus with equal doses of joy and scepticism that I read numbers reported by Bo-Christer Björk and coworkers.

It soon became clear to me that the study did not adhere to the OA definition by the Budapest Open Access Initiative, which is as follows:

By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

The Bo-Christer Björk et al. do not define what exactly they mean by OA. However, from reading their paper is is pretty clear that any article for which they can get hold of free full text is counted as OA. The license under which the copy is distributed does not to matter, and they thus count the 90% of articles in PubMed Central that are published under non-OA licenses as OA. It does not even seem to matter if the free full text is legal or not, implying that any article of which an illegal copy can be found somewhere on the web is counted as OA.

I have heard of Gold OA and Green OA. It is tempting to call this Black OA. But I won’t. Because it just isn’t OA.