Tag Archives: protein interactions

Announcement: EMBO practical course on computational biology in Heidelberg

June 2016 will likely be a highly productive month for people in my group, since I will not be there much to disturb them. Specifically, I will be involved in running two week-long EMBO practical courses.

One was announced on this blog just two days ago. The other is the also long-running course “Computational biology: Genomes to systems”, which this year will take place on June 19–23 at the European Molecular Biology Laboratory in Heidelberg, Germany. The course will cover a wide range of advanced computational biology topics, including protein networks (taught by STRING collaborator Christian von Mering) and biomedical text mining (taught by me).

Please note that the application deadline is less than a month away, namely on January 31.

More details can be found on .

Announcement: EMBO practical course on protein interaction analysis in Budapest

Later this year, I will once again be one of the teachers on the long-running EMBO practical course “Computational analysis of protein-protein interactions: Sequences, networks and diseases”. The 2016 version of the course will be taking place on May 30 – June 4 in Budapest, Hungary, and the application deadline is February 1.

For more details see the course website or the poster below.


Exercise: Web services

The aim of this practical is to introduce you to the concept of web services as well as to a few useful standard command-line tools and how one can pipe data from one tool into another. Web services are, simply put, websites that are meant to be used by computers rather than humans.

Fetching a URL from the command line

The previous exercises used this article to illustrate named entity recognition. If you want to work with it outside the web browser, you will want to change two things: 1) you will probably not want to work with an HTML web page, but rather retrieve it in XML format, and 2) you will want to retrieve the article with something else than a web browser:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML'

Submitting text to the tagger

In the NER practical, you used the a web service for NER; however, the complexity was hidden from you in the EXTRACT bookmarklet. The way the bookmarklet works, is that it sends text from your web browser to a remove tagging web service and subsequently displays the results.

Let us start by looking behind the curtain and see how an EXTRACT popup is produced. When selecting the the header of the article and clicking the bookmarklet, your browser retrieves the following page to show in the popup:


As you can see, the URL contains data, namely the text to be tagged as well as information on which types of named entities we want to have recognized in the text.

You can retrieve the same information in a tab-delimited format, which is far more useful for computational purposes:


If you want, you can use the curl command to retrieve the same data from the command line.

Retrieving a protein network

Bioinformatics web services are not limited to text mining. For example, the STRING database of protein interactions can also be accessed as a web service. The following URL gives you an interaction network for BCL11B as an image:


Modifying it just slightly, allows you to retrieve the same interactions in PSI-MI-TAB format:

You obtain the exact same data in the command line by running this command:

curl 'http://string-db.org/api/psi-mi-tab/interactions?identifier=ENSP00000349723'

Putting it all together

Using pipes, it is possible to put together multiple different web services and local programs to accomplish complex tasks. Here is an example that puts together everything you have learned above:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' 'http://string-db.org/api/psi-mi-tab/interactionsList' > string_network.tsv

Let us pick apart this monstrosity of a command and see what it does:

  • The first curl command fetches a full-text article from PLOS ONE in XML format
  • The second curl command submits this document to the tagger REST web service, to perform named entity recognition of human genes/proteins
  • The cut command pulls out only column three from the resulting output, which contains the identifiers of the recognized entities
  • The grep command find only the identifiers that start with “ENSP”, which is the proteins
  • The third curl command submits this list of protein identifiers to the STRING database to retrieve a protein interaction network of them in PSI-MI-TAB format
  • Finally, we put that network into a file called string_network.tsv on our server.

In other words, with a single pipe of commands that interacts with three different servers we manage to retrieve a full-text article, perform named entity recognition of human proteins and obtain protein interactions among them. Note that whereas this is possible, it will often be desirable to store some of the intermediate results in files instead of using pipes.

By slightly modifying the command, it is possible to instead retrieve this as an image:

curl 'http://journals.plos.org/plosone/article/asset?id=10.1371/journal.pone.0132736.XML' | curl --data-urlencode 'document@-' --data-urlencode 'entity_types=9606' --data-urlencode 'format=tsv' 'http://tagger.jensenlab.org/GetEntities' | cut -f3 | sort -u | grep '^ENSP' | curl --data-urlencode 'identifiers@-' --data-urlencode 'limit=0' --data-urlencode 'network_flavor=confidence' 'http://string-db.org/api/image/networkList' > string_network.png

STRING network

Announcement: EMBO practical course on protein interaction analysis in South Africa

I very much look forward to once again be part of the team of teachers behind the EMBO practical course “Computational analysis of protein-protein interactions: From sequences to networks”. This time it will for the first time take place on the African continent, more specifically in Cape Town, South Africa. The course will take place from September 23 – October 3 and the application deadline is July 23.

Please check the course website or the poster below for details.

Course poster

Job: Ph.D. stipend in systems biology and bioinformatics

I currently have an open position for a Ph.D. student in my group Cellular Network Biology. My group is part part of the Novo Nordisk Foundation Center for Protein Research (CPR) and is financially supported by the Faculty of Health and Medical Sciences, University of Copenhagen.

The project will primarily focus on developing new, improved methodologies for analysis of large-scale datasets, e.g. from mass spectrometry, in the context of protein interaction networks, protein localization, and expression. In doing so, the aim is both to test scientific hypotheses and to improve existing resources developed within the group, such as STRING and COMPARTMENTS. Candidates are thus expected to have experience with programming and statistics.

The closing date for applications is June 30, 2014. For further details refer to the job advert.

Resource: Antibodypedia bulk download file and STRING payload

Antibodypedia is a very useful resource for finding commercially available antibodies against human proteins developed by Antibodypedia AB and Nature Publishing Group.

The resource is made available under the Creative Commons Attribution-NonCommercial 3.0 license, which allows for reuse and redistribution of the data for non-commercial purposes. However, the data are purely available for browsing through a web interface, which greatly limits systems biology uses of the resource. I thus wrote a robot to scrape all information from the web resource and convert it into a convenient tab-delimited file, which I have made available for download under the same license. This dataset covers a total of 579,038 antibodies against 16,827 human proteins.

To be able to use the dataset in conjunction with STRING and related resources, I next mapped the proteins to STRING protein identifiers. I was able to map 92% of all proteins in Antibodypedia. Having done this, I created the necessary files for the STRING payload mechanism to be able to show the information from Antibodypedia directly within STRING.

The end result looks like this when searching for the WNT7A protein:

Antibodypedia STRING network

The halos around the proteins encode the type and number of antibodies available. Red rings imply that at least one monoclonal antibody exists whereas gray rings imply that only polyclonal antibodies exist. The darker the ring (be it red or gray), the more different antibodies are available.

They STRING payload mechanism also extends the popups with additional information, here shown for LRP6:

Antibodypedia STRING popup

The popup shows the total number of antibodies available and how many of them are monoclonal. It also provides a direct linkout to the relevant protein page on Antibodypedia.

Please, feel free to use this Antibodypedia-STRING mashup.

Announcement: Computational analysis of protein-protein interactions for bench biologists

Once again I will be one of the teachers on an EMBO Practical Course. This time we will be teaching wet-lab biologists about how to do computational analysis of protein-protein interactions. The course will take place September 2-8 at the Max Delbrück Center for Molecular Medicine in Berlin, Germany.

The course aims to help bench scientists become more effective at exploiting the wide range of commonly-used databases and bioinformatics tools that can be used to identify, understand, and predict protein interactions by analyzing their structure, sequences, and other features.

The target group for the course are experimental scientists needing to analyse interaction data in their work, and who have limited experience using bioinformatics tools and resources. The course covers analyses and tools that are applied after potential interactions have been identified. It does not cover analysis of the raw data from, for example, mass spectrometry.

To apply for the course, fill in the online application form. The registration deadline is Friday June 15th 2012. The course fee is 200 euros for academics and 1000 euros for scientists from industry.

Analysis: Three-dimensional DNA structure

A few months ago Bill Noble’s lab at University of Washington published a letter in Nature on a three-dimensional model of the complete nuclear genome of budding yeast:

A three-dimensional model of the yeast genome

Layered on top of information conveyed by DNA sequence and chromatin are higher order structures that encompass portions of chromosomes, entire chromosomes, and even whole genomes. Interphase chromosomes are not positioned randomly within the nucleus, but instead adopt preferred conformations. Disparate DNA elements co-localize into functionally defined aggregates or ‘factories’ for transcription and DNA replication. In budding yeast, Drosophila and many other eukaryotes, chromosomes adopt a Rabl configuration, with arms extending from centromeres adjacent to the spindle pole body to telomeres that abut the nuclear envelope. Nonetheless, the topologies and spatial relationships of chromosomes remain poorly understood. Here we developed a method to globally capture intra- and inter-chromosomal interactions, and applied it to generate a map at kilobase resolution of the haploid genome of Saccharomyces cerevisiae. The map recapitulates known features of genome organization, thereby validating the method, and identifies new features. Extensive regional and higher order folding of individual chromosomes is observed. Chromosome XII exhibits a striking conformation that implicates the nucleolus as a formidable barrier to interaction between DNA sequences at either end. Inter-chromosomal contacts are anchored by centromeres and include interactions among transfer RNA genes, among origins of early DNA replication and among sites where chromosomal breakpoints occur. Finally, we constructed a three-dimensional model of the yeast genome. Our findings provide a glimpse of the interface between the form and function of a eukaryotic genome.

Having previously worked with predicted 3D structure of DNA, such as intrinsic curvature, I was intrigued by the availability of a 3D structure of a complete eukaryotic genome. Based on past analyses of 1D distances in DNA, I expected that the 3D distance between two genes in the genome would correlate with expression, protein interactions, and metabolic pathways.

To test if 3D neighborhood correlates with function and/or regulation, I collected three large sets of protein pairs, namely pairs of co-expressed genes from the STRING database (Pearson correlation coefficient >0.7), interacting protein pairs from the BioGRID database, and pairs of genes assigned to the same pathway by the KEGG database. I subsequently mapped these onto the set of 3D neighbors listed in the supplementary information of the paper, including only 3D neighbors on different chromosomes (in order to eliminate correlations caused by 1D rather than 3D distance). I also mapped the three sets of gene pairs onto a shuffled version of the 3D neighbors, in order to estimate the overlaps that can be expected at random. The results are summarized in the table below:

3D neighbors Shuffled neighbors
Coexpressed (STRING) 58 61
Interacting (BioGRID) 2151 2122
Same pathway (KEGG) 357 344

To make a long story short, the numbers show that 3D genomic neighbors appear to be no more likely to be coexpressed, to interact, or to be involved in the same pathway than random pairs. It could be that they way I perform the analysis is too simplistic or that the data are too noisy to show a signal. However, it is also possible that the 3D structural organization of the genome simply doesn’t have much impact on gene regulation and function.

Analysis: Markov clustering and the case of the unsupported protein complexes

In 2006, Krogan and coworkers published a paper in Nature describing a global analysis of protein complexes in budding yeast. This resulted in a network of 7,123 protein-protein interactions involving 2,708 proteins, which was organized into 547 protein complexes using the Markov clustering algorithm.

Considering my previous two posts, it probably comes as a surprise to nobody that I wanted to check if the issue of unnatural clusters also affected this study. Albert Palleja, a postdoc in my group, thus extracted the 547 sub-networks corresponding the protein complexes and applied single-linkage clustering to check if all clusters corresponded to connected sub-networks.

It turned out that 9 of the 547 protein complexes do not correspond to connected sub-networks in the protein interaction network that formed the basis for the clustering. Two complexes each contain two additional subunits that have no interactions with any of the other subunits of the proposed complex, five complexes contain one additional subunit with no interactions to other subunits, and two complexes are proposed hetero-dimers made up of subunits that do not interact according to the interaction network. These complexes are visualized in the figure below with the erroneous subunits highlighted in red:

To check if these additional subunits are in any way supported by the experimental data presented in the paper, I downloaded the set of raw purification from the Krogan Lab Interactome Database. For 4 of the 9 complexes, the additional subunits are weakly supported by at least one purification. It should be noted, however, that this evidence was not judged to be sufficiently reliable by the authors themselves to include the interaction in the core network based on which the complexes were derived.

To make a long story short, this analysis shows that 9 of the 547 protein complexes published by Krogan and coworkers contain one or more subunits that are not supported by the interaction network from which the complexes were derived. Of these, 5 complexes contain subunits that have no support in the underlying experimental data, and which are purely artifacts of using the MCL algorithm without without enforcing that clusters must correspond to connected sub-networks.

Resource: STRING v8.1

After months of hard work from the entire STRING team – thanks everyone –  I am pleased to be able to say that STRING v8.1 has now been put into production. Here is a screen shot of the start page:

STRING 8.1 start page

This is a minor release of STRING, which means that the imported databases of microarray expression data, protein interactions, genetic interactions, and pathways as well as text-mining evidence have all been updated. We have also fixed a bug that affected the minority of bacteria that have multiple chromosomes.

Another notable feature of STRING v8.1 is the new interactive network viewer that is implemented in Adobe Flash:

STRING 8.1 network viewer

For further details please see the post on the official STRING/STITCH blog.

WebCiteCite this post