Archive for the 'Analysis' Category

Analysis: Christmas no longer in vogue!

December 24, 2011

I have just made an alarming discovery: judging from the biomedical literature, researchers appear to increasingly ignore Christmas.

My plan was to make a funny Christmas post looking at trivialities such as when during the year Christmas-related papers are posted. To this end, I did a trivial text-mining analysis that pulled out all papers mentioning “Christmas”, “Xmas”, or “X-mas” in the title or abstract. As a first check of the data, I looked at how many papes were published each year and was surprised to find only 20-30 in a typical year. To eliminate random fluctuations due to the low counts, I thus binned the data into decades before plotting the temporal trend (black dots are actual data points, red curve is a quadratic trendline):

The shocking result is that the frequency of Christmas-related papers has steadily dropped to less than half of what it was in the 1950s!

How can this be? I can think of several possibilities, and you are welcome to come with more in the comments:

  • We are running out of new funny things to say about Christmas.
  • An increasing proportion of researchers come from countries, in which Christmas is not widely celebrated.
  • Researchers have collectively stopped believing in Santa, as funding has dried up.

Merry Christmas Everyone!

Analysis: Toward doing science

May 20, 2011

Yesterday, Rangarajan and coworkers published a paper in BMC Bioinformtatics entitled “Toward an interactive article: integrating journals and biological databases”. Not many hours later Neil Saunders made the following tweet commenting on it:

Can we ban use of "toward(s)" in article titles?

This reminded me of a draft blog post that I wrote in 2008 on the use of the word “toward(s)” in article titles, and I decided that it was time to update the plot and finally publish it. The background was that I had the gut feeling that there was a somewhat disturbing trend, namely that more and more papers use these words in the title. I thus went to Medline and counted the fraction of papers from each year having a title starting with “toward” or “towards” (I also included them if towards appeared inside the title following a colon, semicolon, or dash):

The plot shows that fraction of articles with “toward(s)” in the title is rapidly rising; it has more than tripled over the past two decades. There is thus no doubt that the use of “toward(s)” in article titles is a trend in biomedical publishing.

As is often the case with statistics, though, this analysis answers only one question but leads to several new ones. Are we increasingly selling our papers on what we hope to do soon rather than on what we have actually done? Or have we just become more honest by now adding the word “toward(s)” where we might have left it out in the past?

Analysis: 10butnotMe

May 5, 2011

About five years ago George Church announced the Personal Genome Project (PGP). A very interesting aspect of this project is that all data are released under the Creative Commons Zero waiver. This includes not only the genetic data, but also some medical information and even the identity of each individual.

Although PGP has enrolled more than a thousand individuals, it is presently only possible to download data on ten individuals. It is obviously pointless to attempt to link genotype to phenotype based on such a small number of individuals. However, I wondered if any meaningful structure would emerge if I calculated the Hamming distances for all pairs of individuals, that is the number of SNPs by which they differ (download).

Like said so done. I downloaded all available SNP data from PGP (including array and exome sequencing data), calculated all pairwise SNP distances, and visualized the results as a heatmap along with the faces of the individuals (click for a larger version of the figure):

Number of SNP differences between PGP10 individuals

Individual #10 stands out as being genetically most dissimilar from everyone else, which is unsurprising as he is the only African American in the study. I next tried to similarly define the genetically most average individual, that is the individual that is most similar to everyone else. If one defines this as the individual with the lowest sum of differences, the answer is individual #7. However, because the origins of his grandparents are unknown, it is difficult to conclude anything interesting based on this.

Analysis: Three-dimensional DNA structure

November 2, 2010

A few months ago Bill Noble’s lab at University of Washington published a letter in Nature on a three-dimensional model of the complete nuclear genome of budding yeast:

A three-dimensional model of the yeast genome

Layered on top of information conveyed by DNA sequence and chromatin are higher order structures that encompass portions of chromosomes, entire chromosomes, and even whole genomes. Interphase chromosomes are not positioned randomly within the nucleus, but instead adopt preferred conformations. Disparate DNA elements co-localize into functionally defined aggregates or ‘factories’ for transcription and DNA replication. In budding yeast, Drosophila and many other eukaryotes, chromosomes adopt a Rabl configuration, with arms extending from centromeres adjacent to the spindle pole body to telomeres that abut the nuclear envelope. Nonetheless, the topologies and spatial relationships of chromosomes remain poorly understood. Here we developed a method to globally capture intra- and inter-chromosomal interactions, and applied it to generate a map at kilobase resolution of the haploid genome of Saccharomyces cerevisiae. The map recapitulates known features of genome organization, thereby validating the method, and identifies new features. Extensive regional and higher order folding of individual chromosomes is observed. Chromosome XII exhibits a striking conformation that implicates the nucleolus as a formidable barrier to interaction between DNA sequences at either end. Inter-chromosomal contacts are anchored by centromeres and include interactions among transfer RNA genes, among origins of early DNA replication and among sites where chromosomal breakpoints occur. Finally, we constructed a three-dimensional model of the yeast genome. Our findings provide a glimpse of the interface between the form and function of a eukaryotic genome.

Having previously worked with predicted 3D structure of DNA, such as intrinsic curvature, I was intrigued by the availability of a 3D structure of a complete eukaryotic genome. Based on past analyses of 1D distances in DNA, I expected that the 3D distance between two genes in the genome would correlate with expression, protein interactions, and metabolic pathways.

To test if 3D neighborhood correlates with function and/or regulation, I collected three large sets of protein pairs, namely pairs of co-expressed genes from the STRING database (Pearson correlation coefficient >0.7), interacting protein pairs from the BioGRID database, and pairs of genes assigned to the same pathway by the KEGG database. I subsequently mapped these onto the set of 3D neighbors listed in the supplementary information of the paper, including only 3D neighbors on different chromosomes (in order to eliminate correlations caused by 1D rather than 3D distance). I also mapped the three sets of gene pairs onto a shuffled version of the 3D neighbors, in order to estimate the overlaps that can be expected at random. The results are summarized in the table below:

3D neighbors Shuffled neighbors
Coexpressed (STRING) 58 61
Interacting (BioGRID) 2151 2122
Same pathway (KEGG) 357 344

To make a long story short, the numbers show that 3D genomic neighbors appear to be no more likely to be coexpressed, to interact, or to be involved in the same pathway than random pairs. It could be that they way I perform the analysis is too simplistic or that the data are too noisy to show a signal. However, it is also possible that the 3D structural organization of the genome simply doesn’t have much impact on gene regulation and function.

Analysis: Half of published URLs are dysfunctional a decade later

August 22, 2010

As a small aside when setting up a local mirror of Medline, I extracted 15,915 URLs that were mentioned in the abstracts. Checking them revealed that 12,354 of them (78%) were functional, which may not seem that bad. However, plotting the percentage of dysfunctional URLs as a function of publication year reveals a less pleasant trend:

Dysfunctional URLs

After just 10 years, half of all published URLs are no longer functional, and do not redirect to the new location of the service (if one exists). The fairly high success rate overall is merely a consequence of most URLs having been published within the last few years. Unless the persistence of URLs is improving (which I see no sign of in the plot), we can thus expect to have thousands of URLs in the published literature that are no longer valid.

Edit: Andrew Lang pointed out a similar study of URLs cited in communications journals.

Edit: Duncan Hull pointed out a paper on URL decay in Medline by Jonathan Wren, which reminded me of an even earlier paper on the topic.

Analysis: Markov clustering and the case of the unsupported protein complexes

March 3, 2010

In 2006, Krogan and coworkers published a paper in Nature describing a global analysis of protein complexes in budding yeast. This resulted in a network of 7,123 protein-protein interactions involving 2,708 proteins, which was organized into 547 protein complexes using the Markov clustering algorithm.

Considering my previous two posts, it probably comes as a surprise to nobody that I wanted to check if the issue of unnatural clusters also affected this study. Albert Palleja, a postdoc in my group, thus extracted the 547 sub-networks corresponding the protein complexes and applied single-linkage clustering to check if all clusters corresponded to connected sub-networks.

It turned out that 9 of the 547 protein complexes do not correspond to connected sub-networks in the protein interaction network that formed the basis for the clustering. Two complexes each contain two additional subunits that have no interactions with any of the other subunits of the proposed complex, five complexes contain one additional subunit with no interactions to other subunits, and two complexes are proposed hetero-dimers made up of subunits that do not interact according to the interaction network. These complexes are visualized in the figure below with the erroneous subunits highlighted in red:

To check if these additional subunits are in any way supported by the experimental data presented in the paper, I downloaded the set of raw purification from the Krogan Lab Interactome Database. For 4 of the 9 complexes, the additional subunits are weakly supported by at least one purification. It should be noted, however, that this evidence was not judged to be sufficiently reliable by the authors themselves to include the interaction in the core network based on which the complexes were derived.

To make a long story short, this analysis shows that 9 of the 547 protein complexes published by Krogan and coworkers contain one or more subunits that are not supported by the interaction network from which the complexes were derived. Of these, 5 complexes contain subunits that have no support in the underlying experimental data, and which are purely artifacts of using the MCL algorithm without without enforcing that clusters must correspond to connected sub-networks.

Analysis: Markov clustering and the case of the nonhomologous orthologs

March 2, 2010

In the previous blog post I described how the MCL algorithm can sometimes produce unnatural clusters with disconnected parts. The C implementation of MCL has an option to suppress this behavior (--force-connected=y), but I suspect that it is rarely used. I have thus taken a closer look at some notable applications of MCL in bioinformatics to see if unnatural clusters arise in real data sets.

Here I will focus on OrthoMCL-DB, which is a database of orthologous groups of protein sequences. These were constructed by applying the MCL algorithm to the normalized results of an all-against-all BLAST search of the protein sequences.

To check the connectivity of the resulting orthologous groups, I downloaded OrthoMCL version 4 including the 13+ GB of gzipped BLAST results that formed the basis for the MCL clustering. I wish to thanks to the OrthoMCL-DB team for being very helpful and making this large data set available to me.

A few Perl scripts and CPU hours later, Albert Palleja and I had extracted the BLAST network for each of the 116,536 orthologous groups and performed single-linkage clustering to check if any of them contained disconnected parts. We found that this was the case for the following 28 orthologous groups:

Orthologous group Protein
OG4_10123 tcru|Tc00.1047053448329.10
OG4_10133 cmer|CMS291C
OG4_11608 bmor|BGIBMGA011561
OG4_13082 lbic|eu2.Lbscf0004g03370
OG4_17434 cint|ENSCINP00000028818
nvec|e_gw.40.282.1
OG4_20715 mbre|fgenesh2_pg.scaffold_4000474
OG4_20953 tpal|NP_218832
OG4_21182 tvag|TVAG_333570
OG4_24433 tmar|NP_229533
OG4_29163 tcru|Tc00.1047053508221.76
OG4_32884 gzea|FGST_11535
OG4_36484 cbri|WBGene00088730
cjej|YP_002344482
OG4_39391 ddis|DDB_G0279421
OG4_43780 cpar|cgd3_1080
OG4_44179 atha|NP_177880
OG4_44684 bmal|YP_104794
rbal|NP_868387
OG4_45409 rcom|29647.m002000
rcom|29848.m004679
OG4_50671 pram|C_scaffold_62000023
OG4_50712 bpse|YP_331887.1
OG4_52326 bmaa|14961.m05365
OG4_52455 bmal|YP_338428
OG4_55725 apis|XP_001952076
OG4_57272 bbov|XP_001610684.1
OG4_58797 hwal|YP_659316
OG4_61264 crei|122343
OG4_68577 bmor|BGIBMGA000864
OG4_71107 cbur|NP_819756
OG4_84041 tcru|Tc00.1047053479883.10

For convenience, the orthologous groups are linked to the corresponding web pages in OrthoMCL-DB, which enable viewing of Pfam domain architectures and multiple sequence alignments. Cursory inspection suggests that the majority of the of the sequences listed in the table do not belong to the orthologous groups in question.

Of the 28 orthologous groups, 24 groups contain a single protein with no BLAST hits to other group members, 2 groups each contain 2 such singletons, and the remaining 2 groups each contain 2 proteins that show weak similarity to each other but not to any other group members. The latter proteins are highlighted in red.

In summary, this analysis shows that the unnatural clustering by MCL reported for a toy example in the previous post also affects the results of real-world bioinformatics applications of the algorithm.

Analysis: Markov clustering and the case of the unnatural clusters

March 1, 2010

The MCL (Markov CLustering) algorithm was invented/discovered by Stijn van Dongen and was published in 2000. It has since become highly popular in bioinformatics and has proven to perform well on a variety of different problems.

It was also the method of choice when my postdoc Albert Palleja needed to cluster the human interaction network from the STRING database. However, we got strange results. More specifically, we observed that some clusters contained proteins that had no interactions with any other proteins within the same cluster. I call these unnatural clusters; this should be seen as a contrast to natural clusters, which are characterized by the presence of many edges between the members of a cluster.

After we had spent a week unsuccessfully trying to find out what we were doing wrong, I finally asked myself if it could be that we were not doing anything wrong. Might it be that applying the MCL algorithm to a protein interaction network can result in clusters of non-interacting proteins?

To test this, I constructed the following toy network consisting of only 10 nodes and 12 edges:

Assigning a weight of 1 to all edges and running this network through MCL using an inflation factor (the key parameter in the MCL algorithm) between 1.734 and 3.418 yields five clusters. In the figure below, the nodes are colored according to which cluster they belong to:

Note the black cluster which consists of two proteins, X and Y, despite the two nodes only being connected via nodes that are not part of the same cluster. This example clearly shows that the MCL algorithm is indeed capable of producing unnatural clusters containing nodes with no direct edges to any other members in the cluster.

In my view this is not as such a error in the the MCL algorithm. The algorithm is based on simulation of flow in the graph. The nodes X and Y are clustered due to the strong flow between them via nodes A, C, E, and G. However, I think it is fair to say that this behavior will catch many users by surprise and that it can give rise to misleading results when applying MCL to certain types of networks.

Edit: I suspect that this is the same issue that was reported on the Mcl-users mailing list by Sungwon Jung. Using the --force-connected=y option prevents the undesirable clustering of X and Y.

Analysis: Correlating the PLoS article level metrics

January 15, 2010

A few months ago, the Public Library of Science (PLoS) made available a spreadsheet with article level metrics. Although others have already analyzed these data (see posts by Mike Chelen), I decided to take a closer look at the PLoS article level metrics.

The data set consists of 20 different article level metrics. However, some of these are very sparse and some are partially redundant. I thus decided to filter/merge these to create a reduced set of only 6 metrics:

  1. Blog posts. This value is the sum of Blog Coverage – Postgenomic, Blog Coverage – Nature Blogs, and Blog Coverage – Bloglines. A single blog post may obviously be picked up by multiple of these resources and hence be counted more than once. Being unable to count unique blog posts referring to a publication, I decided to aim for maximal coverage by using the sum rather than using data for only a single resource.
  2. Bookmarks. This value is the sum of Social Bookmarking – CiteULike and Social Bookmarking – Connotea. One cannot rule out that a single user bookmarks the same publication in both CiteULike and Connotea, but I would assume that most people use one or the other for bookmarking.
  3. Citations. This value is the sum of Citations – CrossRef, Citations – PubMed Central, and Citations – Scopus. I decided to use the sum to be consistent with the other metrics, but a single citation may obviously be picked up by more than one of these resources.
  4. Downloads. This value is called Combined Usage (HTML + PDF + XML) in the original data set and is the sum of Total HTML Page Views, Total PDF Downloads, and Total XML Downloads. Again the sum is used to be consistent.
  5. Ratings. This value is called Number of Ratings in the original data set. Because of the small number of articles with rating, notes, and comments, I decided to discard the related values Average Rating, Number of Note threads, Number of replies to Notes, Number of Comment threads, Number of replies to Comments, and Number of ‘Star Ratings’ that also include a text comment.
  6. Trackbacks. This value is called Number of Trackbacks in the original data set. I was greatly in doubt whether to merge this into the blog post metric, but in the end decided against doing so because trackbacks do not necessarily originate from blog posts.

Calculating all pairwise correlations among these metrics is obviously trivial. However, one has to be careful when interpreting the correlations as there are at least two major confounding factors. First, it is important to keep in mind that the PLoS article level metrics have been collected across several journals. Some of these journals are high impact journals such as PLoS Biology and PLoS Medicine, whereas others are lower impact journals such as PLoS ONE. One would expect that papers published in the former two journals will on average have higher values for most metrics than the latter journal. Papers published in journals with a web-savvy readership, e.g. PLoS Computational Biology, are more likely to receive blog posts and social bookmarks. Second, the age of a paper matters. Both downloads and in particular citations accumulate over time. To correct for these confounding factors, I constructed a normalized set of article level metrics, in which each metric for a given article was divided by the average for articles published the same year in the same journal.

I next calculated all pairwise Pearson correlation coefficients among the reduced set of article level metrics. To see the effect of the normalization, I did this for both the raw and the normalized metrics. I visualized the correlation coefficients as a heat map, showing the results for the raw metrics above the diagonal and the results for the normalized metrics below the diagonal.

There are a several interesting observations to be made from this figure:

  • Downloads correlate strongly with all the other metrics. This is hardly surprising, but it is reassuring to see that these correlations are not trivially explained by age and journal effects.
  • Bookmarks is the metric that apart from number of downloads correlates most strongly with Citations. This makes good sense since CiteULike and Connotea are commonly used as reference managers. If you add a paper to you bibliography database, you will likely cite it at some point.
  • Blog posts and Trackbacks correlate well with Downloads but poorly with citations. This may reflect that blog posts about research papers are often targeted towards a broad audience; if most of the readers of the blog posts are laymen or researchers from other fields, they will be unlikely to cite the papers covered in the blog posts.
  • Ratings correlates fairly poorly with every other metric. Combined with the low number of ratings, this makes me wonder if the option to rate papers on the journal web sites is all that useful.

Finally, I will point out one additional metrics that I would very much like to see added in future versions of this data set, namely microblogging. I personally discover many papers through others mentioning them on Twitter or FriendFeed. Because of the much smaller the effort involved in microblogging a paper as opposed to writing a full blog post about it, I suspect that the number of tweets that link to a paper would be a very informative metric.

Edit: I made a mistake in the normalization program, which I have now corrected. I have updated the figure and the conclusions to reflect the changes. It should be noted that some comments to this post were made prior to this correction.

Analysis: Limited agreement among lists of Cdc28p substrates

November 3, 2009

A collaboration between the Morgan lab at UCSF and the Gygi lab at Harvard has resulted in a paper by Holt et al. in Science, which reports the identification of several hundred substrates of the central cell-cycle kinase Cdc28p (also known as Cdk1) in the budding yeast Saccharomyces cerevisiae:

Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution.

To explore the mechanisms and evolution of cell-cycle control, we analyzed the position and conservation of large numbers of phosphorylation sites for the cyclin-dependent kinase Cdk1 in the budding yeast Saccharomyces cerevisiae. We combined specific chemical inhibition of Cdk1 with quantitative mass spectrometry to identify the positions of 547 phosphorylation sites on 308 Cdk1 substrates in vivo. Comparisons of these substrates with orthologs throughout the ascomycete lineage revealed that the position of most phosphorylation sites is not conserved in evolution; instead, clusters of sites shift position in rapidly evolving disordered regions. We propose that the regulation of protein function by phosphorylation often depends on simple nonspecific mechanisms that disrupt or enhance protein-protein interactions. The gain or loss of phosphorylation sites in rapidly evolving regions could facilitate the evolution of kinase-signaling circuits.

The paper makes several interested in analyses and observations. However, I found the comparison to the previous study of Cdc28p substrates by Ubersax et al. from the Morgan lab to be less detailed than I had hoped for:

Phosphorylation of Cdk1 consensus sites was observed on 67% (122 of 181) of proteins previously identified as Cdk1 substrates in vitro (4). Sixty-six percent (80 of 122) of these proteins contained sites at which phosphorylation decreased (log2 H/L < –1) after inhibition of Cdk1 (only 45 of 122 are expected if there is no correlation between the experiments in vitro and in vivo; χ2 test, P < 10-10).

In other words, 44% (80 of 181) of Cdc28p substrates identified in the old study were confirmed by the new study, and only 26% (80 of 308) of the Cdc28p substrates identified in the new study are supported by the old study. There are many possible explanations for this discrepancy

Depth of the mass spectrometry

It is notoriously difficult to identify peptides from low-abundance proteins in mass spectrometry. In the new mass spectrometry study, the authors were able to map 8710 precise phosphorylation sites on 1957 proteins. However, budding yeast is estimated to express in the order of 4500 distinct proteins during exponential growth (Gavin et al., 2006). Assuming that the majority of these proteins contain sites that are phosphorylated during at least part of the mitotic cell cycle, it is likely that a considerable number of low-abundance Cdc28p substrates identified in the old study have been missed in the new study.

Biases in phosphopeptide enrichment

When doing phosphoproteomics, it is necessary to first enrich for phosphopeptides to improve the coverage. To this end, Holt et al. used immobilized metal affinity chromatography (IMAC). In 2007, the Aebersold group at ETH published a paper showing that different purification methods lead to isolation of different, partially overlapping segments of the phosphoproteome. Specifically, they showed that IMAC enrichment biases the data towards isolation of multiply phosphorylated peptides. Given that only a single purification method was used, it is likely that in vivo Cdc28p substrates may have been missed in the new study, in particular if the peptides contain only a single phosphorylation site.

In vitro vs. in vivo conditions

The old study by Ubersax et al. was done performed on cell lysate, which is an in vitro strategy (although all other proteins expressed during the cell cycle are present). It is thus likely that some of the proteins that are phosphorylated by Cdc28p under these conditions are nonetheless not in vivo Cdc28p substrates.

Can we do better?

As always, it is easy to point out potential flaws in other people’s data sets; however, it is much more constructive to do something about the problems. The challenge is thus to construct a larger and more reliable set of Cdc28p substrates by combining the data from the two studies.

To check the feasibility of assigning confidence scores to different putative Cdc28p substrates, I tested if the fold change observed in the new study correlates with the chance that the substrate was also identified in the old study. To this end, I divided the 308 Cdc28p substrates from the new studies into two groups and constructed histograms of the fold changes for each group:

Phosphorylation ratios from Holt et al.

The fold changes are clearly skewed towards larger negative values for the Cdc28p substrates also identified by the old study relative to the proteins that were not previously identified as Cdc28p substrates. This difference is statistically significant at P < 1% according to the Kolmogorov-Smirnov test. This suggests that the observed fold changes in the new mass spectrometry study correlates with the likelihood that the proteins are true Cdc28p substrates.

The old study gave rise to so-called P-score for the individual proteins (not to be confused with P-values). I decided to test if these too can be used as quality scores, I constructed an equivalent histogram in which the Cdc28p substrates found in the old study were divided into two groups based on whether or not they were also found in the new study:

P-scores from Ubersax et al.

In this case, no obvious trend is seen and a Kolmogorov-Smirnov test indeed reveals no statistically significant difference between the two distributions. Surprisingly, the P-scores do thus not appear to be useful quality scores for the putative Cdc28p substrates.

Given the two sets of putative Cdc28 substrates, only one of which can be ranked by reliability, how can we create a better combined set? If one aims for the high accuracy at the price of low coverage, one could obviously choose to trust only the substrates identified by both screens. However, given the caveats regarding depth of mass spectrometry and biases arising from the enrichment procedure, I would be hesitant to use this approach. Alternatively, one could aim for maximal coverage at the price of accuracy by trusting all sites identified by either study. However, seeing the large fraction of novel substrates identified by Holt et al. with a log2-ratio only slightly below -1, I would personally tend to apply a more stringent threshold to the data from the new study by Holt et al., for example requiring log2-ratio below -2, before merging the sets of substrates from the two studies.

WebCiteCite this post

Follow

Get every new post delivered to your Inbox.

Join 445 other followers