In the previous blog post I described how the MCL algorithm can sometimes produce unnatural clusters with disconnected parts. The C implementation of MCL has an option to suppress this behavior (
--force-connected=y), but I suspect that it is rarely used. I have thus taken a closer look at some notable applications of MCL in bioinformatics to see if unnatural clusters arise in real data sets.
Here I will focus on OrthoMCL-DB, which is a database of orthologous groups of protein sequences. These were constructed by applying the MCL algorithm to the normalized results of an all-against-all BLAST search of the protein sequences.
To check the connectivity of the resulting orthologous groups, I downloaded OrthoMCL version 4 including the 13+ GB of gzipped BLAST results that formed the basis for the MCL clustering. I wish to thanks to the OrthoMCL-DB team for being very helpful and making this large data set available to me.
A few Perl scripts and CPU hours later, Albert Palleja and I had extracted the BLAST network for each of the 116,536 orthologous groups and performed single-linkage clustering to check if any of them contained disconnected parts. We found that this was the case for the following 28 orthologous groups:
For convenience, the orthologous groups are linked to the corresponding web pages in OrthoMCL-DB, which enable viewing of Pfam domain architectures and multiple sequence alignments. Cursory inspection suggests that the majority of the of the sequences listed in the table do not belong to the orthologous groups in question.
Of the 28 orthologous groups, 24 groups contain a single protein with no BLAST hits to other group members, 2 groups each contain 2 such singletons, and the remaining 2 groups each contain 2 proteins that show weak similarity to each other but not to any other group members. The latter proteins are highlighted in red.
In summary, this analysis shows that the unnatural clustering by MCL reported for a toy example in the previous post also affects the results of real-world bioinformatics applications of the algorithm.