Analysis: Markov clustering and the case of the unnatural clusters

The MCL (Markov CLustering) algorithm was invented/discovered by Stijn van Dongen and was published in 2000. It has since become highly popular in bioinformatics and has proven to perform well on a variety of different problems.

It was also the method of choice when my postdoc Albert Palleja needed to cluster the human interaction network from the STRING database. However, we got strange results. More specifically, we observed that some clusters contained proteins that had no interactions with any other proteins within the same cluster. I call these unnatural clusters; this should be seen as a contrast to natural clusters, which are characterized by the presence of many edges between the members of a cluster.

After we had spent a week unsuccessfully trying to find out what we were doing wrong, I finally asked myself if it could be that we were not doing anything wrong. Might it be that applying the MCL algorithm to a protein interaction network can result in clusters of non-interacting proteins?

To test this, I constructed the following toy network consisting of only 10 nodes and 12 edges:

Assigning a weight of 1 to all edges and running this network through MCL using an inflation factor (the key parameter in the MCL algorithm) between 1.734 and 3.418 yields five clusters. In the figure below, the nodes are colored according to which cluster they belong to:

Note the black cluster which consists of two proteins, X and Y, despite the two nodes only being connected via nodes that are not part of the same cluster. This example clearly shows that the MCL algorithm is indeed capable of producing unnatural clusters containing nodes with no direct edges to any other members in the cluster.

In my view this is not as such a error in the the MCL algorithm. The algorithm is based on simulation of flow in the graph. The nodes X and Y are clustered due to the strong flow between them via nodes A, C, E, and G. However, I think it is fair to say that this behavior will catch many users by surprise and that it can give rise to misleading results when applying MCL to certain types of networks.

Edit: I suspect that this is the same issue that was reported on the Mcl-users mailing list by Sungwon Jung. Using the --force-connected=y option prevents the undesirable clustering of X and Y.

9 thoughts on “Analysis: Markov clustering and the case of the unnatural clusters

  1. betascience

    Great post. I have been using MCL quite a bit lately for protein family clustering and I wasn’t aware of this situation. I am going to try the –force-conected=y option to see if this an issue with my “real” data.

    Reply
  2. Pingback: Analysis: Markov clustering and the case of the nonhomologous orthologs « Buried Treasure

  3. Pingback: Analysis: Markov clustering and the case of the unsupported protein complexes « Buried Treasure

  4. Pingback: My Data Mining Weblog » Which Data Mining Algorithm Is Right For You?

  5. vonfrisch

    Dear all, very interesting post! I tried the toy example in different ways using an MCL implementation in matlab that can be found at the official mcl website. What I found is that the problem happens ONLY if the input matrix is symmetric. On the contrary, if the matrix is asymmetric you get the right clustering (all nodes in the same cluster). Concerning also the other post on orthoMCL, this means that when the -force–connected option has not been used results could be wrong?

    Reply
    1. Lars Juhl Jensen Post author

      I could be that it only can happen for symmetric cases, but I am no certain. I have not tried to see yet if it is possible to make asymmetric examples for which the odd behavior of MCL occurs.

      Regarding OrthoMCL, wrong examples can occur when -force-connected is not used, i.e. non-homologous sequences can end up in “artificial” clusters together. It should be said, though, that those cases are extremely rare in OrthoMCL.

  6. Marcelo Gomes

    Sorry to “revamp” an “old” post, but this tread on cluster is very informative to whomever comes across the MCL algorithm.
    I just wanted to stress a phrase by the end of the post: this is not an “error” of the algorithm per se. We always have to remember that any cluster/community detection algorithm has no idea of what you network “means”. So it tries to find patterns in the network and draw communities based on those patterns. Different algorithms look for different patterns.
    Whether the communities found make sense or not depends on the network “meaning”, but that is for the user to judge, not the algorithm. Since communities have different meanings (and patterns) depending on the network context.
    The example given (x,y), (a,b), (c,d), …, is perfect. In some cases, that separation actually makes perfect sense. For instance, imagine the network represents influence. Nodes are people and edges are communication or hierarchy in a company. X and Y are clear “leaders” while the other pairs are pools/chains. So, although the communities found have different characteristics, (X,Y) “leaders”, the others are “friends” or “followers”, it makes perfect sense to separate them in that way. If you think of them as power grid, again it makes sense since X and Y would be “centers of distribution” while the others are towers along different streets.

    On the other hand, if what you want is communities of “direct influence”, than this is structure is “wrong” since X and Y do not influence each other directly. But not because the algorithm is wrong, is just that this particular pattern — that exists in the network — is not what you want to use as proxy for your communities.

    So, in the end, the message here is: know that this is a possible output of the “default” use of the algorithm, and change it with the option –force-conected=y if this property does not make sense for your particular application.

    Cheers,
    Marcelo.

    Reply
    1. Lars Juhl Jensen Post author

      Thanks for your comment! We fully agree that this is not a bug – the MCL algorithm does what the MCL algorithm was designed to do. Also, the issue is easily fixed by using the option to force connected clusters. However, I think it is fair to say that the example I present here is probably not what most people would expect to obtain. This is supported by the two follow-up blog posts that show how people using MCL to define orthologous groups and protein complexes have been bitten by this. This makes me think that many, if not most, users of MCL are blissfully unaware of this issue and should be running MCL with –force-connected=y to avoid it.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s