Monthly Archives: March 2010

Analysis: Markov clustering and the case of the unsupported protein complexes

In 2006, Krogan and coworkers published a paper in Nature describing a global analysis of protein complexes in budding yeast. This resulted in a network of 7,123 protein-protein interactions involving 2,708 proteins, which was organized into 547 protein complexes using the Markov clustering algorithm.

Considering my previous two posts, it probably comes as a surprise to nobody that I wanted to check if the issue of unnatural clusters also affected this study. Albert Palleja, a postdoc in my group, thus extracted the 547 sub-networks corresponding the protein complexes and applied single-linkage clustering to check if all clusters corresponded to connected sub-networks.

It turned out that 9 of the 547 protein complexes do not correspond to connected sub-networks in the protein interaction network that formed the basis for the clustering. Two complexes each contain two additional subunits that have no interactions with any of the other subunits of the proposed complex, five complexes contain one additional subunit with no interactions to other subunits, and two complexes are proposed hetero-dimers made up of subunits that do not interact according to the interaction network. These complexes are visualized in the figure below with the erroneous subunits highlighted in red:

To check if these additional subunits are in any way supported by the experimental data presented in the paper, I downloaded the set of raw purification from the Krogan Lab Interactome Database. For 4 of the 9 complexes, the additional subunits are weakly supported by at least one purification. It should be noted, however, that this evidence was not judged to be sufficiently reliable by the authors themselves to include the interaction in the core network based on which the complexes were derived.

To make a long story short, this analysis shows that 9 of the 547 protein complexes published by Krogan and coworkers contain one or more subunits that are not supported by the interaction network from which the complexes were derived. Of these, 5 complexes contain subunits that have no support in the underlying experimental data, and which are purely artifacts of using the MCL algorithm without without enforcing that clusters must correspond to connected sub-networks.

Analysis: Markov clustering and the case of the nonhomologous orthologs

In the previous blog post I described how the MCL algorithm can sometimes produce unnatural clusters with disconnected parts. The C implementation of MCL has an option to suppress this behavior (--force-connected=y), but I suspect that it is rarely used. I have thus taken a closer look at some notable applications of MCL in bioinformatics to see if unnatural clusters arise in real data sets.

Here I will focus on OrthoMCL-DB, which is a database of orthologous groups of protein sequences. These were constructed by applying the MCL algorithm to the normalized results of an all-against-all BLAST search of the protein sequences.

To check the connectivity of the resulting orthologous groups, I downloaded OrthoMCL version 4 including the 13+ GB of gzipped BLAST results that formed the basis for the MCL clustering. I wish to thanks to the OrthoMCL-DB team for being very helpful and making this large data set available to me.

A few Perl scripts and CPU hours later, Albert Palleja and I had extracted the BLAST network for each of the 116,536 orthologous groups and performed single-linkage clustering to check if any of them contained disconnected parts. We found that this was the case for the following 28 orthologous groups:

Orthologous group Protein
OG4_10123 tcru|Tc00.1047053448329.10
OG4_10133 cmer|CMS291C
OG4_11608 bmor|BGIBMGA011561
OG4_13082 lbic|eu2.Lbscf0004g03370
OG4_17434 cint|ENSCINP00000028818
nvec|e_gw.40.282.1
OG4_20715 mbre|fgenesh2_pg.scaffold_4000474
OG4_20953 tpal|NP_218832
OG4_21182 tvag|TVAG_333570
OG4_24433 tmar|NP_229533
OG4_29163 tcru|Tc00.1047053508221.76
OG4_32884 gzea|FGST_11535
OG4_36484 cbri|WBGene00088730
cjej|YP_002344482
OG4_39391 ddis|DDB_G0279421
OG4_43780 cpar|cgd3_1080
OG4_44179 atha|NP_177880
OG4_44684 bmal|YP_104794
rbal|NP_868387
OG4_45409 rcom|29647.m002000
rcom|29848.m004679
OG4_50671 pram|C_scaffold_62000023
OG4_50712 bpse|YP_331887.1
OG4_52326 bmaa|14961.m05365
OG4_52455 bmal|YP_338428
OG4_55725 apis|XP_001952076
OG4_57272 bbov|XP_001610684.1
OG4_58797 hwal|YP_659316
OG4_61264 crei|122343
OG4_68577 bmor|BGIBMGA000864
OG4_71107 cbur|NP_819756
OG4_84041 tcru|Tc00.1047053479883.10

For convenience, the orthologous groups are linked to the corresponding web pages in OrthoMCL-DB, which enable viewing of Pfam domain architectures and multiple sequence alignments. Cursory inspection suggests that the majority of the of the sequences listed in the table do not belong to the orthologous groups in question.

Of the 28 orthologous groups, 24 groups contain a single protein with no BLAST hits to other group members, 2 groups each contain 2 such singletons, and the remaining 2 groups each contain 2 proteins that show weak similarity to each other but not to any other group members. The latter proteins are highlighted in red.

In summary, this analysis shows that the unnatural clustering by MCL reported for a toy example in the previous post also affects the results of real-world bioinformatics applications of the algorithm.

Analysis: Markov clustering and the case of the unnatural clusters

The MCL (Markov CLustering) algorithm was invented/discovered by Stijn van Dongen and was published in 2000. It has since become highly popular in bioinformatics and has proven to perform well on a variety of different problems.

It was also the method of choice when my postdoc Albert Palleja needed to cluster the human interaction network from the STRING database. However, we got strange results. More specifically, we observed that some clusters contained proteins that had no interactions with any other proteins within the same cluster. I call these unnatural clusters; this should be seen as a contrast to natural clusters, which are characterized by the presence of many edges between the members of a cluster.

After we had spent a week unsuccessfully trying to find out what we were doing wrong, I finally asked myself if it could be that we were not doing anything wrong. Might it be that applying the MCL algorithm to a protein interaction network can result in clusters of non-interacting proteins?

To test this, I constructed the following toy network consisting of only 10 nodes and 12 edges:

Assigning a weight of 1 to all edges and running this network through MCL using an inflation factor (the key parameter in the MCL algorithm) between 1.734 and 3.418 yields five clusters. In the figure below, the nodes are colored according to which cluster they belong to:

Note the black cluster which consists of two proteins, X and Y, despite the two nodes only being connected via nodes that are not part of the same cluster. This example clearly shows that the MCL algorithm is indeed capable of producing unnatural clusters containing nodes with no direct edges to any other members in the cluster.

In my view this is not as such a error in the the MCL algorithm. The algorithm is based on simulation of flow in the graph. The nodes X and Y are clustered due to the strong flow between them via nodes A, C, E, and G. However, I think it is fair to say that this behavior will catch many users by surprise and that it can give rise to misleading results when applying MCL to certain types of networks.

Edit: I suspect that this is the same issue that was reported on the Mcl-users mailing list by Sungwon Jung. Using the --force-connected=y option prevents the undesirable clustering of X and Y.