Exercise: Using the STITCH database

January 26, 2010

The STITCH database contains functional associations among proteins and small molecules.

Try searching STITCH with the human thymidylate synthase (TYMS) protein as input. The resulting network includes several small molecules.

Questions:

  • Can you identify the products of thymidylate synthase among them?
  • Are the reactants also present in the network?

Sometimes the proteins or small molecules that you search for may not be immediately shown by STITCH. To find what you are looking for you may have to extend the network.

Questions:

  • Do the small molecules that were missing in the questions above appear when clicking the Add nodes button?
  • Can you construct a clearer network with fewer interactions by changing the network parameters at the bottom of the page?

Thymidine is required for DNA replication and repair to take place, and inhibition of thymidine synthase is thus harmful to proliferating cells. Indeed, most of the small molecules in the network are drugs used for chemotherapy.

Questions:

  • Are these drugs structurally similar to each other?
  • Are they similar to substrate of thymidylate synthase?
  • Can you suggest a mechanism of action?

Analysis: Correlating the PLoS article level metrics

January 15, 2010

A few months ago, the Public Library of Science (PLoS) made available a spreadsheet with article level metrics. Although others have already analyzed these data (see posts by Mike Chelen), I decided to take a closer look at the PLoS article level metrics.

The data set consists of 20 different article level metrics. However, some of these are very sparse and some are partially redundant. I thus decided to filter/merge these to create a reduced set of only 6 metrics:

  1. Blog posts. This value is the sum of Blog Coverage – Postgenomic, Blog Coverage – Nature Blogs, and Blog Coverage – Bloglines. A single blog post may obviously be picked up by multiple of these resources and hence be counted more than once. Being unable to count unique blog posts referring to a publication, I decided to aim for maximal coverage by using the sum rather than using data for only a single resource.
  2. Bookmarks. This value is the sum of Social Bookmarking – CiteULike and Social Bookmarking – Connotea. One cannot rule out that a single user bookmarks the same publication in both CiteULike and Connotea, but I would assume that most people use one or the other for bookmarking.
  3. Citations. This value is the sum of Citations – CrossRef, Citations – PubMed Central, and Citations – Scopus. I decided to use the sum to be consistent with the other metrics, but a single citation may obviously be picked up by more than one of these resources.
  4. Downloads. This value is called Combined Usage (HTML + PDF + XML) in the original data set and is the sum of Total HTML Page Views, Total PDF Downloads, and Total XML Downloads. Again the sum is used to be consistent.
  5. Ratings. This value is called Number of Ratings in the original data set. Because of the small number of articles with rating, notes, and comments, I decided to discard the related values Average Rating, Number of Note threads, Number of replies to Notes, Number of Comment threads, Number of replies to Comments, and Number of ‘Star Ratings’ that also include a text comment.
  6. Trackbacks. This value is called Number of Trackbacks in the original data set. I was greatly in doubt whether to merge this into the blog post metric, but in the end decided against doing so because trackbacks do not necessarily originate from blog posts.

Calculating all pairwise correlations among these metrics is obviously trivial. However, one has to be careful when interpreting the correlations as there are at least two major confounding factors. First, it is important to keep in mind that the PLoS article level metrics have been collected across several journals. Some of these journals are high impact journals such as PLoS Biology and PLoS Medicine, whereas others are lower impact journals such as PLoS ONE. One would expect that papers published in the former two journals will on average have higher values for most metrics than the latter journal. Papers published in journals with a web-savvy readership, e.g. PLoS Computational Biology, are more likely to receive blog posts and social bookmarks. Second, the age of a paper matters. Both downloads and in particular citations accumulate over time. To correct for these confounding factors, I constructed a normalized set of article level metrics, in which each metric for a given article was divided by the average for articles published the same year in the same journal.

I next calculated all pairwise Pearson correlation coefficients among the reduced set of article level metrics. To see the effect of the normalization, I did this for both the raw and the normalized metrics. I visualized the correlation coefficients as a heat map, showing the results for the raw metrics above the diagonal and the results for the normalized metrics below the diagonal.

There are a several interesting observations to be made from this figure:

  • Downloads correlate strongly with all the other metrics. This is hardly surprising, but it is reassuring to see that these correlations are not trivially explained by age and journal effects.
  • Bookmarks is the metric that apart from number of downloads correlates most strongly with Citations. This makes good sense since CiteULike and Connotea are commonly used as reference managers. If you add a paper to you bibliography database, you will likely cite it at some point.
  • Blog posts and Trackbacks correlate well with Downloads but poorly with citations. This may reflect that blog posts about research papers are often targeted towards a broad audience; if most of the readers of the blog posts are laymen or researchers from other fields, they will be unlikely to cite the papers covered in the blog posts.
  • Ratings correlates fairly poorly with every other metric. Combined with the low number of ratings, this makes me wonder if the option to rate papers on the journal web sites is all that useful.

Finally, I will point out one additional metrics that I would very much like to see added in future versions of this data set, namely microblogging. I personally discover many papers through others mentioning them on Twitter or FriendFeed. Because of the much smaller the effort involved in microblogging a paper as opposed to writing a full blog post about it, I suspect that the number of tweets that link to a paper would be a very informative metric.

Edit: I made a mistake in the normalization program, which I have now corrected. I have updated the figure and the conclusions to reflect the changes. It should be noted that some comments to this post were made prior to this correction.


Job: Bioinformatics scientist in Protein Production Unit of the NNF Center for Protein Research

January 7, 2010

At the Novo Nordisk Foundation Center for Protein Research we are looking for a scientist to provide bioinformatics support for the Protein Production Unit. For further details, please see the job advert below the fold.
Read the rest of this entry »


Editorial: What is the difference between Twitter and Second Life?

January 6, 2010

I admit that this may seem a strange question. One is a microblogging platform that allows you read and write messages of at most 140 characters. The other is a 3D virtual world. They are, however, both communication tools, and I think there is a completely different reason why Twitter is so much more useful to me than Second Life is.

If I go to the Twitter web site and log in, I see this:

It is Twitter. It immediately shows me the main content: tweets. It also allows me to create content, that is to tweet.

By contrast, if I go to the Second Life web site and log in, I see this:

It is not Second Life. It is a complex web interface that gives me access to account administration tools and shows me lists of blog posts by Linden lab, comments from Second Life users, items for sale on Xstreet SL, and video tutorials. In the lower left corner it shows me the only really useful information, namely which of my friends are online. That is, the friends that I would have been able to chat with, had I been in Second Life and not on the web site, which does not allow you to read or write messages.

Imagine if the web interface of Second Life would instead show me this:

It would be Second Life. It would immediately show me the main content: the virtual world. It would also allow me to interact with the content, that is to move around, to chat with people, and even to create content. Do you think Second Life would have more users if it would run inside your web browser? I think so. Linden Lab is and has been focusing on improving the initial user experience in Second Life to improve the retention rate (i.e. the fraction of new users that continue to come back). I am not saying that this is not important, but I think that most of the potential users are lost long before they even get into the virtual world.

This is by no means a problem that is specific to Second Life. Today, asking users to install a piece of software on their computer will cause the majority of people to shy away before they have tried your product. Even just asking users to create an account will cause many to turn around and walk away. When it comes to social networks, the decisive factor is users. If your friends are not there, why should you? Imagine a virtual world that would run in your web browser and which you could sign into using OpenID, Twitter Connect, or Facebook Connect. Would your friends be there? Would you?


Job: Postdoctoral position in RNA bioinformatics and systems biology

December 27, 2009

In collaboration with Jan Gorodkin at the Faculty for Life Sciences at University of Copenhagen, I will be starting up a project related to non-coding RNAs and their interactions with mRNAs. We have secured funding for the project and thus now searching for the right person to fill a postdoc position. For further details, please read the job advert below the fold.
Read the rest of this entry »


Update: The BuzzCloud for 2009

December 22, 2009

It is that time of the year again: NCBI has rolled out the new PubMed baseline, and it is my pleasure to present you with the latest and greatest of biomedical buzzwords. I present to you the BuzzCloud 2009 (click for a larger interactive version):

In case you have no idea what a BuzzCloud is, it is a visualization of some of the most trendy words in PubMed. To make a long story short, the size of the word represents how many times it was mentioned in the past, whereas the brightness represents how much it was mentioned in the year compared to the previous ten years. For more details, please refer to the original blog post.

The three largest words on the BuzzCloud 2009 are all reruns from earlier years: metagenomics and synthetic biology were both first seen on the BuzzCloud 2004) and click chemistry appeared in 2006. One can only conclude that these research areas continue to grow.

At the other end of the scale we have the small and bright words. These are the words that are rising most rapidly but have not appeared that many times in PubMed yet. Below are three selected examples that I think may be of particular interest to the readership of this blog.

  • Personal genomics. No surprise here except that I expected this word would have turned up much earlier considering the broad publicity of the 1000 Genomes Project and the Personal Genome Project.
  • Proteogenomics. Why we need a separate word for referring to the combination of proteomics and genomics is beyond me. There is even a paper on comparative proteogenomics published in Genome Research. One can only wonder when someone will compare metabolomics, proteomics, transcriptomics, and genomics data across environmental samples and coin the term comparative metametaboproteotranscriptogenomics.
  • Translational bioinformatics. Where bioinformatics meets clinical medicine (see blog post by Russ Altman). I think that bioinformaticians are indeed increasingly working on medically relevant data, which in my view is a good thing. It just makes me wonder what happened to medical informatics?

On a closing note, I am again pleasantly surprised how well the words picked up by a completely automated procedure fit with the ongoing activities in my lab. It is almost eerie.


Poll: Do you publish in the NAR database issue, and do you read what others publish there?

December 21, 2009

The annual database issue of Nucleic Acids Research is now online. This year it contains a staggering 135 papers, which should be enough to keep all bioinformaticians busy over Christmas.

This makes me wonder how many of the readers of this blog have published in the NAR database issue (not necessarily this year), and how many of you actually read what others publish there. I have thus set up a highly unscientific poll:

The terms many, some, and very few are obviously somewhat fuzzy. As a rough guideline, I would define many as >10 papers per issue, some as 5-10, and very few as <5.


Analysis: Limited agreement among lists of Cdc28p substrates

November 3, 2009

A collaboration between the Morgan lab at UCSF and the Gygi lab at Harvard has resulted in a paper by Holt et al. in Science, which reports the identification of several hundred substrates of the central cell-cycle kinase Cdc28p (also known as Cdk1) in the budding yeast Saccharomyces cerevisiae:

Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution.

To explore the mechanisms and evolution of cell-cycle control, we analyzed the position and conservation of large numbers of phosphorylation sites for the cyclin-dependent kinase Cdk1 in the budding yeast Saccharomyces cerevisiae. We combined specific chemical inhibition of Cdk1 with quantitative mass spectrometry to identify the positions of 547 phosphorylation sites on 308 Cdk1 substrates in vivo. Comparisons of these substrates with orthologs throughout the ascomycete lineage revealed that the position of most phosphorylation sites is not conserved in evolution; instead, clusters of sites shift position in rapidly evolving disordered regions. We propose that the regulation of protein function by phosphorylation often depends on simple nonspecific mechanisms that disrupt or enhance protein-protein interactions. The gain or loss of phosphorylation sites in rapidly evolving regions could facilitate the evolution of kinase-signaling circuits.

The paper makes several interested in analyses and observations. However, I found the comparison to the previous study of Cdc28p substrates by Ubersax et al. from the Morgan lab to be less detailed than I had hoped for:

Phosphorylation of Cdk1 consensus sites was observed on 67% (122 of 181) of proteins previously identified as Cdk1 substrates in vitro (4). Sixty-six percent (80 of 122) of these proteins contained sites at which phosphorylation decreased (log2 H/L < –1) after inhibition of Cdk1 (only 45 of 122 are expected if there is no correlation between the experiments in vitro and in vivo; χ2 test, P < 10-10).

In other words, 44% (80 of 181) of Cdc28p substrates identified in the old study were confirmed by the new study, and only 26% (80 of 308) of the Cdc28p substrates identified in the new study are supported by the old study. There are many possible explanations for this discrepancy

Depth of the mass spectrometry

It is notoriously difficult to identify peptides from low-abundance proteins in mass spectrometry. In the new mass spectrometry study, the authors were able to map 8710 precise phosphorylation sites on 1957 proteins. However, budding yeast is estimated to express in the order of 4500 distinct proteins during exponential growth (Gavin et al., 2006). Assuming that the majority of these proteins contain sites that are phosphorylated during at least part of the mitotic cell cycle, it is likely that a considerable number of low-abundance Cdc28p substrates identified in the old study have been missed in the new study.

Biases in phosphopeptide enrichment

When doing phosphoproteomics, it is necessary to first enrich for phosphopeptides to improve the coverage. To this end, Holt et al. used immobilized metal affinity chromatography (IMAC). In 2007, the Aebersold group at ETH published a paper showing that different purification methods lead to isolation of different, partially overlapping segments of the phosphoproteome. Specifically, they showed that IMAC enrichment biases the data towards isolation of multiply phosphorylated peptides. Given that only a single purification method was used, it is likely that in vivo Cdc28p substrates may have been missed in the new study, in particular if the peptides contain only a single phosphorylation site.

In vitro vs. in vivo conditions

The old study by Ubersax et al. was done performed on cell lysate, which is an in vitro strategy (although all other proteins expressed during the cell cycle are present). It is thus likely that some of the proteins that are phosphorylated by Cdc28p under these conditions are nonetheless not in vivo Cdc28p substrates.

Can we do better?

As always, it is easy to point out potential flaws in other people’s data sets; however, it is much more constructive to do something about the problems. The challenge is thus to construct a larger and more reliable set of Cdc28p substrates by combining the data from the two studies.

To check the feasibility of assigning confidence scores to different putative Cdc28p substrates, I tested if the fold change observed in the new study correlates with the chance that the substrate was also identified in the old study. To this end, I divided the 308 Cdc28p substrates from the new studies into two groups and constructed histograms of the fold changes for each group:

Phosphorylation ratios from Holt et al.

The fold changes are clearly skewed towards larger negative values for the Cdc28p substrates also identified by the old study relative to the proteins that were not previously identified as Cdc28p substrates. This difference is statistically significant at P < 1% according to the Kolmogorov-Smirnov test. This suggests that the observed fold changes in the new mass spectrometry study correlates with the likelihood that the proteins are true Cdc28p substrates.

The old study gave rise to so-called P-score for the individual proteins (not to be confused with P-values). I decided to test if these too can be used as quality scores, I constructed an equivalent histogram in which the Cdc28p substrates found in the old study were divided into two groups based on whether or not they were also found in the new study:

P-scores from Ubersax et al.

In this case, no obvious trend is seen and a Kolmogorov-Smirnov test indeed reveals no statistically significant difference between the two distributions. Surprisingly, the P-scores do thus not appear to be useful quality scores for the putative Cdc28p substrates.

Given the two sets of putative Cdc28 substrates, only one of which can be ranked by reliability, how can we create a better combined set? If one aims for the high accuracy at the price of low coverage, one could obviously choose to trust only the substrates identified by both screens. However, given the caveats regarding depth of mass spectrometry and biases arising from the enrichment procedure, I would be hesitant to use this approach. Alternatively, one could aim for maximal coverage at the price of accuracy by trusting all sites identified by either study. However, seeing the large fraction of novel substrates identified by Holt et al. with a log2-ratio only slightly below -1, I would personally tend to apply a more stringent threshold to the data from the new study by Holt et al., for example requiring log2-ratio below -2, before merging the sets of substrates from the two studies.

WebCiteCite this post


Editorial: Social network plumbing

August 12, 2009

I guess it is no secret to anyone that Facebook as agreed to acquire FriendFeed. Several people seem puzzled why I left FriendFeed only 3 hours after learning this news. I can understand that this may look like a knee-jerk reaction, but there is logic behind the madness.

The truths is that my existing setup of Web 2.0 services was not working nearly as well as I would like. The sheer amount of content being shared on FriendFeed meant that it was easy to overlook a blog post from one of my favorite bloggers, for which reason I still subscribed to their blogs as RSS feeds. This caused me to waste time because the same posts appear in two place, and I could not filter out the blogs on FriendFeed because most comments would be posted there and not on the blogs. Receiving everyone’s tweets on FriendFeed tended to create a background noise that would drown all other conversation; however, I could also not filter out the Twitter streams on FriendFeed and follow people directly on Twitter instead because many cross-post all their FriendFeed “likes” and/or comments to Twitter!

Given the new situation, it was clear to me that the time had come to fix my broken social network setup and redo the plumbing in such a way that FriendFeed would no longer be responsible for gathering most of the content. Looking at FriendFeed, I discovered that most of the content of interest originated from just three sources: RSS feeds of blogs, Google Reader shared items, and Twitter. By following people directly on Google Reader and Twitter, both of which I was already using on a daily basis, I was thus able to relegate FriendFeed to a much less important role. I still feed my content from other sources into FriendFeed and I occasionally check for comments on my posts; however, it is no longer where I read content posted by others. Coincidentally, the new role of FriendFeed is almost identical to the role that Facebook has played all along.

To make a long story short, I’m not leaving the friendly community at FriendFeed in anger. I still read the content produced and shared by the same people as before. I have just fixed the plumbing.


Analysis: Results from thermal stability shift and competition binding assays correlate well

July 31, 2009

Several large kinase inhibitor screens have been published in recent years. Two of the largest come from Stefan Knapp’s lab and Ambit, respectively. The former group used a temperature shift assay to measure the change in thermal stability of 60 human serine/threonine kinases that is caused by the binding of each of 156 kinase inhibitors (Fedorov et al., 2007). The latter group used a competition a competition binding assay to measure the dissociation constants (Kd) for 38 kinase inhibitors and 290 distinct kinases (Karaman et al., 2008).

The two screens are not directly comparable because one measures temperature shifts whereas the other measures dissociation constants. To see if it possible to convert temperature shift values to Kd values, I asked Damian Szklarczyk (who is a Ph.D. student in my group) to map all data from both screens onto a common set of chemical and protein identifiers, extract all inhibitor-kinase pairs that were measured in both assays, and make a scatter plot of -log(Kd) as function of temperature shift. The result was a set of 704 pairs of temperature shift and Kd values. In the plot below, inhibitor-kinase pairs for which binding was not observed in the competition binding assay were defined to have a Kd of 10 microM, and negative values from the temperature shift assay were treated as zero temperature shift.

Correlation between temperature shift and -log(Kd)

The plot shows that the two assays are in very good agreement, which is surprising considering that the assays are fundamentally very different and were run using different expression constructs for several of the kinases. The linear Pearson correlation coefficient is 0.92 when excluding the one obvious outlier shown in red (BIRB796 vs. MAPK11; this appears to be a false negative in the competition binding assay).

The linear fit gives an intercept with the y-axis of 4.9223, which implies that a temperature shift of zero (i.e. no binding according to the temperature shift assay) does not translate precisely into a Kd of 10 microM (i.e. no binding according to the competition binding assay). We thus did a second linear regression in which we forced the intercept with the y-axis to 5 (red regression line in the plot). We thereby at the calibration function -log(Kd) = 5+0.244*Ts, which allows us to to convert temperature shifts to Kd values. We have thereby managed to put the measurements from the two kinase inhibitor screens onto a common basis that facilitates direct comparison and integration.

Full disclosure: I have an on-going collaboration with Stefan Knapp’s lab related to screening of kinase inhibitor.

WebCiteCite this post