Tag Archives: phenotypes

Analysis: Automatic recognition of Human Phenotype Ontology terms in text

This afternoon, an article entitled “Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora” showed up in my RSS reader. It describes a new gold-standard corpus for named entity recognition of Human Phenotype Ontology (HPO). The article also presents results from evaluating three automatic HPO term recognizers, namely NCBO Annotator, OBO Annotator and Bio-LarK CR.

I thought it would be a fun challenge to see how good an HPO tagger I could produce in one afternoon. Long story short, here is what I did in five hours:

  • Downloaded the HPO ontology file and converted it to dictionary files for tagger.
  • Generated orthographic variants of term by changing the order of sub terms, converting between Arabic and Roman numerals, and constructing plural forms.
  • Used the tagger to match the resulting dictionary against entire Medline to identify frequently occurring matches.
  • Constructed a list of stop words by manually inspected all matching strings with more than 25,000 occurrences in PubMed.
  • Tagged the gold-standard corpus making use of the dictionary and stop-words list and compared the results to the manual reference annotations.

My tagger produced 1183 annotations on the corpus, 786 of which correspond to the 1933 human annotations (requiring exact coordinate matches and HPO term normalization). This amounts to a precision of 66%, a recall of 41%, and an F1 score of 50%. This places my system right in the middle between NCBO Annotator (precision=54%, recall=39%, F1=45%) and the best performing system Bio-LarK CR (65% precision, 49% recall, F1=56%).

Not too shabby for five hours of work — if I may say so myself — and a good reminder of how much can be achieved in very limited time by taking a simple, pragmatic approach.

Announcement: From genomes to cells and systems

Later this year Peer Bork, Jeroen Raes, Roland Krause, David Torrents, and I will be organizing the EMBO practical course “Computational biology: From genomes to cells and systems”. It will take place October 14-20 in L’Escala Girona, Catalonia.

In times when high-throughput data are the norm rather than the exception, computational skills to turn masses of data into tangible biological insights have become crucial. This course will teach advanced computational methods for analysis of high-throughput data in molecular biology, covering both inter-individual and inter-species variation in (meta-)genomes and linking it to clinical applications. The course will span protein and pathway level variation from single genomes to entire microbial communities.

To participate in this course, fill in the online application form at the latest July 31, 2012. The registration fee is 250 euros for participants from academia, and 600 euros for industry.

Analysis: Cell-cycle phenotypes and regulation, part 2

I have previously blogged about the relationship between cell-cycle phenotypes and regulation in human as well as budding yeast. I was thus excited to see the new RNAi study on cell-cycle phenotypes by Rines and coworkers that was published in Genome Biology two days ago. The title of their paper is “Whole genome functional analysis identifies novel components required for mitotic spindle integrity in human cells”, and the abstract reads as follows:


The mitotic spindle is a complex mechanical apparatus required for accurate segregation of sister chromosomes during mitosis. We designed a genetic screen using automated microscopy to discover factors essential for mitotic progression. Using a RNAi library of 49,164 double-stranded (ds)RNAs targeting 23,835 human genes, we performed a loss-of-function screen looking for siRNAs that arrest cells in metaphase.


Here we report the identification of genes that when suppressed result in structural defects in the mitotic spindle leading to bent, twisted, monopolar or multipolar spindles and cause a cell cycle arrest. We further described a novel analysis methodology for large-scale RNAi datasets which relies upon supervised clustering of these genes based on gene ontology (GO), protein families, tissue expression and protein-protein interactions.


This approach was utilized to functionally classify the identified genes in discrete mitotic processes. We confirmed the identity for a subset of these genes and examined more closely their mechanical role in spindle architecture.

The screen identified a set of 226 genes that when suppressed lead to spindle-related cell-cycle phenotypes. Using the name-mapping files from STRING, I was able to map 175 of them to the set of genes used in my other cell-cycle analyses. The results presented below are all based on this set of 175 genes.

To my surprise, Rines and coworkers did not compare their results to the earlier phenotypic screen published by Mukherji et al. in PNAS. Since I had already mapped this dataset onto the same gene set, it was easy to make a comparison of the new phenotype data from Rines et al. and the eight phenotypic categories defined by Mukherji et al.:

Category Description Overlap Significance
1 G1 small nuclear area 2/116 n.s.
2 G1 2/117 n.s.
3 S 1/61 n.s.
4 S + G2/M 4/59 P < 0.002; FDR < 1%
5 G2/M large nucleus 5/200 P < 0.019; FDR < 5%
6 G2/M 4/259 n.s.
7 G2/M + endoduplication 1/52 n.s.
8 Cytokinesis 3/36 P < 0.003; FDR < 1%

The statistical significance of the overlap was assessed using Fisher’s exact test and the false discovery rate (FDR) was calculated using the Benjamini-Hochberg method. As can be seen, the agreement between the two studies is very poor. Nonetheless, it is reassuring that the largest overlap (>8%) is observed for category 8, since spindle defects should be expected to result in problems during cytokinesis.

I also looked into the transcriptional and post-translational regulation of the 175 genes. The cell-cycle microarray study by Whitfield and coworkers covered 124 of the genes, 15 of which are periodically expressed (P < 0.002; Fisher’s exact test). Plotting the distribution of peak times for these genes confirms the observation by Rines et al. that the genes tend to be expressed around the G2/M transition and during M phase:

Peak time distributions for human genes identified by Rines et al. and Mukheriji et al.

As should be expected, the peak-time distribution for the genes identified by Rines et al. is in agreement with the corresponding distributions for categories 4, 5, and 8 from Mukherji and coworkers.

Comparison with a set of 985 phosphoproteins identified in low-throughput studies (obtained from Phospho.ELM) shows that the proteins products encoded by the 175 genes are preferentially phosphorylated (P < 0.001; Fisher’s exact test). This result is confirmed by comparisons with large mass-spectrometry studies (P < 0.03; Fisher’s exact test) and CDK substrates predicted by NetPhosK (P < 0.05; Fisher’s exact test).

Finally, I analyzed the protein products encoded by the 175 genes for degradation signals. 22 of them contain a strong D-box motif (P < 0.03; Fisher’s exact test) and 28 contain a KEN-box motif (P < 0.002; Fisher’s exact test). By contrast, the gene products identified by Rines et al. display no overrepresentation of PEST degradation signals. This makes sense since proteins with D-box and/or KEN-box motifs are polyubiquitinated by the anaphase-promoting complex (APC) during late M phase, which targets them for degradation by the proteasome.

In summary, Rines and coworkers has identified a set of genes that show weak but significant overlap with some of the phenotypic categories defined by Mukherji et al., with periodically expressed genes identified based on microarray data from Whitfield et al., with known and predicted phosphoproteins, and with predicted degradation signals. All of the results are consistent with the majority of the 175 genes functioning during G2/M and early M phase.

WebCiteCite this post

Analysis: Cell-cycle phenotypes and regulation

In 2006 the Schultz lab at the Scripps Research Institute published a paper in PNAS called “Genome-wide functional analysis of human cell-cycle regulators”. The abstract reads:

Human cells have evolved complex signaling networks to coordinate the cell cycle. A detailed understanding of the global regulation of this fundamental process requires comprehensive identification of the genes and pathways involved in the various stages of cell-cycle progression. To this end, we report a genome-wide analysis of the human cell cycle, cell size, and proliferation by targeting >95% of the protein-coding genes in the human genome using small interfering RNAs (siRNAs). Analysis of >2 million images, acquired by quantitative fluorescence microscopy, showed that depletion of 1,152 genes strongly affected cell-cycle progression. These genes clustered into eight distinct phenotypic categories based on phase of arrest, nuclear area, and nuclear morphology. Phase-specific networks were built by interrogating knowledge-based and physical interaction databases with identified genes. Genome-wide analysis of cell-cycle regulators revealed a number of kinase, phosphatase, and proteolytic proteins and also suggests that processes thought to regulate G1-S phase progression like receptor-mediated signaling, nutrient status, and translation also play important roles in the regulation of G2/M phase transition. Moreover, 15 genes that are integral to TNF/NF-κB signaling were found to regulate G2/M, a previously unanticipated role for this pathway. These analyses provide systems-level insight into both known and novel genes as well as pathways that regulate cell-cycle progression, a number of which may provide new therapeutic approaches for the treatment of cancer.

I recently wrote a commentary about how phenotypes in yeast agree remarkably well with the just-in-time assembly hypothesis for cell-cycle regulation of protein complexes. I thus decided to also compare the dataset on cell-cycle phenotypes for human genes with the cell-cycle microarray expression data published in 2002 by Whitfield and coworkers.

Using the mapping files from the STRING database, I was able to automatically map 741 of the 1152 genes with cell-cycle phenotypes to the set of 12,097 genes for which we have cell-cycle microarray expression data. Of the 741 genes, 55 are among the a of 600 periodically expressed genes identified in a reanalysis of the data from Whitfield and coworkers. This is just shy of 50% more than what would be expected by random chance (P < 0.001; Fisher’s exact test).

The authors divided the cell-cycle mutants into eight classes. Repeating the above analysis for each of these categories separately revealed that genes with phenotypes related to S-phase and cytokinesis were significantly overrepresented among the 600 periodically expressed genes (FDR < 0.05; Fisher’s exact test and Benjamini-Hochberg correction for multiple testing). The other categories did not yield statistically significant results.

To look at the temporal regulation of transcription in more detail, I plotted the distribution of peak times (the point in the cell cycle when a gene is maximally expressed) for the periodically expressed genes from each of the eight phenotypic categories:

Peak time distributions for human genes with cell-cycle-related phenotypes

For the periodically expressed genes that display a cell-cycle phenotype in the screen by Schultz and coworkers, the observed phenotypes agree with the time of peak expression. In particular, the genes with cytokinesis-related phenotypes are all expressed shortly before the time of cell division (cytokinesis). Most of the periodically expressed genes with phenotypes related to S phase are similarly expressed during S phase (roughly 50-70% into the cell cycle), genes with phenotypes related to the G2/M transition also tend to be expressed during the appropriate phase of the cell cycle.

In summary, these results support the view that cell-cycle-regulated genes are expressed shortly before their time of action, despite the fact that regulation also takes place at the protein level. It also confirms that many genes with cell-cycle function are not subject to transcriptional cell-cycle regulation.

WebCiteCite this post

Commentary: Does just-in-time assembly of protein complexes explain phenotypes?

Beginning of this year Ben Lehner’s lab published a beautiful study in BMC Systems Biology with the title “A simple principle concerning the robustness of protein complex activity to changes in gene expression”. The abstract reads:


The functions of a eukaryotic cell are largely performed by multi-subunit protein complexes that act as molecular machines or information processing modules in cellular networks. An important problem in systems biology is to understand how, in general, these molecular machines respond to perturbations.


In yeast, genes that inhibit growth when their expression is reduced are strongly enriched amongst the subunits of multi-subunit protein complexes. This applies to both the core and peripheral subunits of protein complexes, and the subunits of each complex normally have the same loss-of-function phenotypes. In contrast, genes that inhibit growth when their expression is increased are not enriched amongst the core or peripheral subunits of protein complexes, and the behaviour of one subunit of a complex is not predictive for the other subunits with respect to over-expression phenotypes.


We propose the principle that the overall activity of a protein complex is in general robust to an increase, but not to a decrease in the expression of its subunits. This means that whereas phenotypes resulting from a decrease in gene expression can be predicted because they cluster on networks of protein complexes, over-expression phenotypes cannot be predicted in this way. We discuss the implications of these findings for understanding how cells are regulated, how they evolve, and how genetic perturbations connect to disease in humans.

It struck me that these observations can all be explained by the just-in-time assembly model for temporal regulation of protein complex assembly, which I developed together with members of Søren Brunak’s group. For a long explanation and discussion of the model see our paper “Evolution of Cell Cycle Control: Same Molecular Machines, Different Regulation”. For the short version see the figure below, which shows how cell-cycle regulation of just a single subunit is sufficient to control when during the cell cycle a complex is active (click to enlarge):

The just-in-time assembly hypothesis

What will happen if you knock down the expression of one subunit of a complex? The maximal number of complete complexes that can be assembled will be reduced, irrespective of whether the subunit is dynamic or static. Whether this results in a given phenotype depends on the function of the complex. However, the effect should in principle be the same for different subunits of the same complex, which is exactly what Lehner and coworkers observed.

What if you instead overexpress one subunit of a complex? For a static subunit it should not really matter; the maximal number of complete complexes that can be assembled is unchanged. On the other hand, overexpression of a dynamic subunit may cause the complex to become constitutively active, which could have disastrous consequences for the cell. Overexpression of dynamic and static subunits of the same complex should thus give rise to different phenotypic effects. This would explain the observation by Lehner and coworkers that subunits of the same complex often have different overexpression phenotypes.

If this hypothesis is true, genes that lead to phenotypic effects when overexpressed should preferentially encode dynamic proteins, i.e. many of the genes should be periodically expressed. In fact, this correlation between overexpression phenotype and cell-cycle regulation was already described by the Hughes, Boone and Andrews labs who originally published the dataset on overexpression phenotypes (for details see their paper in Molecular Cell):

Genes expressed periodically during the cell cycle (de Lichtenberg et al., 2005) were more likely to show an overexpression phenotype (p = 0.017), and in particular, this tended to cause abnormal morphology [p < 10-13] or cell cycle arrest [p < 10-14](Table S3). When the analysis is limited to genes known to function in the mitotic cell cycle, we still find that overexpression of periodically expressed genes is more likely to cause cell cycle arrest (p = 0.008) or abnormal morphology (p = 0.006) than constitutively expressed cell cycle genes (Table S3), indicating that unscheduled expression of genes that are usually expressed periodically often leads to toxicity.

The results of the two papers thus point in the direction that the just-in-time assembly hypothesis can explain the qualitatively differences between knock-down and overexpression phenotypes.

WebCiteCite this post