A collaboration between the Morgan lab at UCSF and the Gygi lab at Harvard has resulted in a paper by Holt et al. in Science, which reports the identification of several hundred substrates of the central cell-cycle kinase Cdc28p (also known as Cdk1) in the budding yeast Saccharomyces cerevisiae:
Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution.
To explore the mechanisms and evolution of cell-cycle control, we analyzed the position and conservation of large numbers of phosphorylation sites for the cyclin-dependent kinase Cdk1 in the budding yeast Saccharomyces cerevisiae. We combined specific chemical inhibition of Cdk1 with quantitative mass spectrometry to identify the positions of 547 phosphorylation sites on 308 Cdk1 substrates in vivo. Comparisons of these substrates with orthologs throughout the ascomycete lineage revealed that the position of most phosphorylation sites is not conserved in evolution; instead, clusters of sites shift position in rapidly evolving disordered regions. We propose that the regulation of protein function by phosphorylation often depends on simple nonspecific mechanisms that disrupt or enhance protein-protein interactions. The gain or loss of phosphorylation sites in rapidly evolving regions could facilitate the evolution of kinase-signaling circuits.
The paper makes several interested in analyses and observations. However, I found the comparison to the previous study of Cdc28p substrates by Ubersax et al. from the Morgan lab to be less detailed than I had hoped for:
Phosphorylation of Cdk1 consensus sites was observed on 67% (122 of 181) of proteins previously identified as Cdk1 substrates in vitro (4). Sixty-six percent (80 of 122) of these proteins contained sites at which phosphorylation decreased (log2 H/L < –1) after inhibition of Cdk1 (only 45 of 122 are expected if there is no correlation between the experiments in vitro and in vivo; χ2 test, P < 10-10).
In other words, 44% (80 of 181) of Cdc28p substrates identified in the old study were confirmed by the new study, and only 26% (80 of 308) of the Cdc28p substrates identified in the new study are supported by the old study. There are many possible explanations for this discrepancy
Depth of the mass spectrometry
It is notoriously difficult to identify peptides from low-abundance proteins in mass spectrometry. In the new mass spectrometry study, the authors were able to map 8710 precise phosphorylation sites on 1957 proteins. However, budding yeast is estimated to express in the order of 4500 distinct proteins during exponential growth (Gavin et al., 2006). Assuming that the majority of these proteins contain sites that are phosphorylated during at least part of the mitotic cell cycle, it is likely that a considerable number of low-abundance Cdc28p substrates identified in the old study have been missed in the new study.
Biases in phosphopeptide enrichment
When doing phosphoproteomics, it is necessary to first enrich for phosphopeptides to improve the coverage. To this end, Holt et al. used immobilized metal affinity chromatography (IMAC). In 2007, the Aebersold group at ETH published a paper showing that different purification methods lead to isolation of different, partially overlapping segments of the phosphoproteome. Specifically, they showed that IMAC enrichment biases the data towards isolation of multiply phosphorylated peptides. Given that only a single purification method was used, it is likely that in vivo Cdc28p substrates may have been missed in the new study, in particular if the peptides contain only a single phosphorylation site.
In vitro vs. in vivo conditions
The old study by Ubersax et al. was done performed on cell lysate, which is an in vitro strategy (although all other proteins expressed during the cell cycle are present). It is thus likely that some of the proteins that are phosphorylated by Cdc28p under these conditions are nonetheless not in vivo Cdc28p substrates.
Can we do better?
As always, it is easy to point out potential flaws in other people’s data sets; however, it is much more constructive to do something about the problems. The challenge is thus to construct a larger and more reliable set of Cdc28p substrates by combining the data from the two studies.
To check the feasibility of assigning confidence scores to different putative Cdc28p substrates, I tested if the fold change observed in the new study correlates with the chance that the substrate was also identified in the old study. To this end, I divided the 308 Cdc28p substrates from the new studies into two groups and constructed histograms of the fold changes for each group:
The fold changes are clearly skewed towards larger negative values for the Cdc28p substrates also identified by the old study relative to the proteins that were not previously identified as Cdc28p substrates. This difference is statistically significant at P < 1% according to the Kolmogorov-Smirnov test. This suggests that the observed fold changes in the new mass spectrometry study correlates with the likelihood that the proteins are true Cdc28p substrates.
The old study gave rise to so-called P-score for the individual proteins (not to be confused with P-values). I decided to test if these too can be used as quality scores, I constructed an equivalent histogram in which the Cdc28p substrates found in the old study were divided into two groups based on whether or not they were also found in the new study:
In this case, no obvious trend is seen and a Kolmogorov-Smirnov test indeed reveals no statistically significant difference between the two distributions. Surprisingly, the P-scores do thus not appear to be useful quality scores for the putative Cdc28p substrates.
Given the two sets of putative Cdc28 substrates, only one of which can be ranked by reliability, how can we create a better combined set? If one aims for the high accuracy at the price of low coverage, one could obviously choose to trust only the substrates identified by both screens. However, given the caveats regarding depth of mass spectrometry and biases arising from the enrichment procedure, I would be hesitant to use this approach. Alternatively, one could aim for maximal coverage at the price of accuracy by trusting all sites identified by either study. However, seeing the large fraction of novel substrates identified by Holt et al. with a log2-ratio only slightly below -1, I would personally tend to apply a more stringent threshold to the data from the new study by Holt et al., for example requiring log2-ratio below -2, before merging the sets of substrates from the two studies.