The MCL (Markov CLustering) algorithm was invented/discovered by Stijn van Dongen and was published in 2000. It has since become highly popular in bioinformatics and has proven to perform well on a variety of different problems.
It was also the method of choice when my postdoc Albert Palleja needed to cluster the human interaction network from the STRING database. However, we got strange results. More specifically, we observed that some clusters contained proteins that had no interactions with any other proteins within the same cluster. I call these unnatural clusters; this should be seen as a contrast to natural clusters, which are characterized by the presence of many edges between the members of a cluster.
After we had spent a week unsuccessfully trying to find out what we were doing wrong, I finally asked myself if it could be that we were not doing anything wrong. Might it be that applying the MCL algorithm to a protein interaction network can result in clusters of non-interacting proteins?
To test this, I constructed the following toy network consisting of only 10 nodes and 12 edges:
Assigning a weight of 1 to all edges and running this network through MCL using an inflation factor (the key parameter in the MCL algorithm) between 1.734 and 3.418 yields five clusters. In the figure below, the nodes are colored according to which cluster they belong to:
Note the black cluster which consists of two proteins, X and Y, despite the two nodes only being connected via nodes that are not part of the same cluster. This example clearly shows that the MCL algorithm is indeed capable of producing unnatural clusters containing nodes with no direct edges to any other members in the cluster.
In my view this is not as such a error in the the MCL algorithm. The algorithm is based on simulation of flow in the graph. The nodes X and Y are clustered due to the strong flow between them via nodes A, C, E, and G. However, I think it is fair to say that this behavior will catch many users by surprise and that it can give rise to misleading results when applying MCL to certain types of networks.
Edit: I suspect that this is the same issue that was reported on the Mcl-users mailing list by Sungwon Jung. Using the
--force-connected=y option prevents the undesirable clustering of X and Y.