Resource: The COMPARTMENTS database on protein subcellular localization

March 24, 2014

Together with collaborators in the groups of Seán O’Donoghue and Reinhard Schneider, my group has recently launched a new web-accessible database named COMPARTMENTS.

COMPARTMENTS unifies subcellular localization evidence from many sources by mapping all proteins and compartments to their STRING identifiers and Gene Ontology terms, respectively. We import curated annotations from UniProtKB and model organism databases and assign confidence scores to them based on their evidence codes. For human proteins, we similarly import and score evidence from The Human Protein Atlas. COMPARTMENTS also uses text mining to derive subcellular localization evidence from co-occurrence of proteins and compartments in Medline abstracts. Finally, we precompute subcellular localization predictions with the sequence-based methods WoLF PSORT and YLoc. For further details, please refer to our recently published paper entitled “COMPARTMENTS: unification and visualization of protein subcellular localization evidence”.

To provide a simple overview of all this information, we visualize the combined localization evidence for each protein onto a schematic of an animal, fungal, or plant cell:

COMPARTMENTS NR3C1

COMPARTMENTS COX1

COMPARTMENTS PSAB

You can click any of the three images above to go to the COMPARTMENTS web resource. To facilitate use in large-scale analyses, the complete datasets for major eukaryotic model organisms are available for download.


Announcement: PTMs in Cell Signaling conference

March 11, 2014

Two years ago, I was one of the organizers of the 2nd Copenhagen Bioscience Conference entitled PTMs in Cell Signaling. I think it is fair to describe it as a highly successful meeting, and it is my great pleasure to announce that we will be organizing a second meeting on the topic September 14-18, 2014.

CBC6 poster

My co-chairs Jeremy Austin Daniel, Michael Lund Nielsen, and Amilcar Flores Morales have managed to put together the following excellent lineup of invited speakers:

Alfonso Valencia, Chris Sander, David Komander, Gary Nolan, Genevieve Almouzni, Guillermo Montoya, Hanno Steen, Henrik Daub, John Blenis, John Diffley, John Tainer, Karolin Luger, Marcus Bantscheff, Margaret Goodell, Matthias Mann, Michael Yaffe, Natalie Ahn, Pedro Beltrao, Stephen Elledge, Tanya Paull, Tony Hunter, Yang Shi, Yehudit Bergman, and Yosef Shiloh.

All conference expenses are covered, which means that there will be no registration fee and no expenses for accommodation or food. You will have to cover your own travel expenses, though.

Participants will be selected based on abstract submission, which is open until June 9, 2014. For more information please see the conference website.


Commentary: Are other women a woman’s worst enemies in science?

March 10, 2014

It is clear that in science, we have a gender bias among leaders. It is my impression that most people think this is due to a combination of men and women having different priorities in life and high-ranking male professors favoring their own gender. Conversely, I have never heard anyone dare to suggest that women may be their own worst enemies in this context.

Benenson and coworkers from Emmanuel College have just published an interesting study in Current Biology on collaborations between full professors and assistant professors entitled “Rank influences human sex differences in dyadic cooperation”.

By tabulating the joint publications, they found 76 same-sex publications from male full professors, which should be compared to a random expectation of 61 such publications. By contrast they found only 14 same-sex publications from female full professors with the random expectation being 29. In other words, whereas male full professors collaborated 25% more with male assistant professors than expected, female full professors collaborated more than 50% less with female assistant professors than expected. The authors conclude:

Our results are consistent with observations suggesting that social structure takes differing forms for human males and females. Males’ tendency to interact in same-gender groups makes them more prone to cooperation with asymmetrically ranked males. In contrast, females’ tendency to restrict their same-gender interactions to equally ranked individuals make them more reluctant to cooperate with asymmetrically ranked females.

There is, in other words, a bias towards high-ranking professors of both genders to preferentially collaborate with lower-ranking male professors as opposed to lower-ranking female professors. If anything, that bias appears to be stronger in case of high-ranking female professors than high-ranking male professors.


Commentary: Coffee, a prerequisite for research?

January 14, 2014

Yesterday, I stumbled upon two links that I found interesting. The first was the map-based data visualization blog post 40 Maps That Will Help You Make Sense of the World, in which maps 24 and 28 hint at a correlation (click for larger interactive versions):

Number of Researchers per million inhabitants by Country

Current Worldwide Annual Coffee Consumption per capita

The first map shows the number of researchers per million inhabitants in each country. The second map shows the number of kg coffee consumed per capita per year. As ChartsBin allows you to download the data behind each map, I did so and produced a scatter plot that confirms the strong correlation (click for larger version):

coffee_vs_researchers

This confirms my view that the coffee machine is the most important piece of hardware in a bioinformatics group. Bioinformaticians with coffee can do work even without a computer, but bioinformaticians without coffee are unable to work, no matter how good computers they have.

One should of course be careful to not jump to conclusions about causality based on correlation. This leads me to the second link: a new study published in Nature Neuroscience, which shows that Post-study caffeine administration enhances memory consolidation in humans.

I optimistically await a similar study confirming the correlation between Chocolate Consumption, Cognitive Function, and Nobel Laureates published last year in New England Journal of Medicine.


Exercise: Hands-on text mining of protein-disease associations

October 21, 2013

Background

In this exercise you will be processing two sets of documents on prostate cancer and schizophrenia, respectively. You will use a program called tagdir along with a pre-made dictionary of human protein names to identify proteins relevant for each disease and to find a protein that links the two diseases to each other.

The software and data files required for this exercise is available as a tarball. You will need to know how to compile a C++ program to run this exercise; also, the C++ program requires Boost C++ Libraries.

How to run the tagger

To execute the tagger program, you need to run a command with the following format:

tagdir dictionary_file blacklist_file documents_directory > matches_file

The argument dictionary_file should be the name of a tab-delimited file containing the dictionary of names to be tagged in the text. For this exercise you will always want to use the file human_proteins.tsv.

The argument blacklist_file should be the name of a tab-delimited file containing the exact variants of names that should not be tagged despite the name being in the dictionary. To tag every variant of every name, you can simply use an empty file (empty.tsv), but you will also be making your own file later as part of the exercise.

The argument documents_directory should be the name of a directory containing text files with the documents to be processed. For this exercise two such directories will be used, namely prostate_cancer_documents and schizophrenia_documents.

Finally you give a name for where to put the tab-delimited output from the tagger (matches_file). You can call these files whatever you like, but descriptive names will make your life easier, for example, prostate_cancer_matches.tsv.

When running the tagger on one of the two directories of documents, you should expect the run to take approximately 2 minutes.

File formats

The tagger program requires the input to be in very specific formats. The dictionary_file must be a tab-delimited file that looks as follows:

263094 ABCA7
285238 ABCC3
354995 ABCG1
268129 ABHD2
233710 ACADL

The first column must be a number that uniquely identifies the protein in question. In human_proteins.tsv the numbers used are the ENSP identifers from Ensembl, with the letters ENSP and any leading zeros removed. The second column is a name for the protein. If a protein has multiple names, there will be several lines, each listing the same number (i.e. the same protein) but a different name for it.

The blacklist_file must similarly be a tab-delimited file in which each line specifies a specific variant of a name and whether it should be blocked or not:

ACP t
ACS t
ACTH f
activin f
act t

The letter “t” in the second column means TRUE (i.e. that the variant should be blocked) and “f” means FALSE (i.e. that it should not be blocked). The file above would thus block ACP, ACS, and act but not ACTH and activin. Because variants are by default not blocked, you in principle only need to add lines with “t”; however, the lines with “f” can be very useful to keep track at which variants you have already looked at and actively decided not to block, as opposed to variants that are not blocked because you have not looked at them.

The output of the tagger is also tab-delimited, each line specifying a possibly meaning of a match in a document:

7478532.txt 254 256 TSG 262120
7478532.txt 254 256 TSG 416324
7478532.txt 595 597 LPL 309757
7478532.txt 595 597 LPL 315757
7478532.txt 658 661 NEFL 221169
7478532.txt 736 738 LPL 309757
7478532.txt 736 738 LPL 315757

The first column tells the name of the file (the numbers are in our case the PubMed IDs of the abstracts). The next two columns specify from which character which character there is a match. The fourth column shows which string appeared in that place in the document. The fifth column specific which protein this could be (i.e. the Ensemble protein number). The first line in the output above thus means that the document 7478532.txt from character 254 to character 256 says TSG, which could mean the protein ENSP00000262120. The second line shows that it alternatively could be ENSP00000416324.

Making a blacklist

As you will see if you run the tagdir using an empty blacklist_file, a simple dictionary matching approach leads to very many wrong matches unless it is complemented by a good list of name variants to be blocked.

To identify potential name variants that you might want to put on the black list, you will want to count the number of occurrences of each and every name variant in an entire corpus of text. A very helpful UNIX command for doing this based on a matches_file (produced by tagdir) is:

cut -f 4 matches_file | sort | uniq -c | sort -nr | less

What it does is first cut out the exact name variants in column 4 of your matches file (cut -f 4), sort them so that multiple copies of the same variant will be right after each other (sort), count identical adjacent lines (uniq -c), sort that list reverse numerical so that the highest counts come first (sort -nr), and show the resulting in a program so that you can scroll through it (less).

The next step is to simply go through the most frequently occurring name variants and manually decide which ones are indeed correct protein names that just occur very frequently, and which ones occur very frequently because they mean something entirely different than the protein. The latter should be put on your blacklist.

Once you have produced a blacklist, you can rerun the tagdir command and you will get much less output because vast numbers of false positive matches have been filtered away. You will find that adding just the worst few offenders to the blacklist will help much; however, as you go further down the list the effort spent on inspecting more names gives diminishing returns.

Finding proteins associated to a disease

Making a very good blacklist would take too long for this exercise. For the following exercise it is thus recommended that you use the provided file blacklist_10.tsv, which we have provided.

The goal is now to produce a top-20 list of proteins that are most often mentioned in papers about prostate cancer. To achieve this, you need to rerun the tagging of human proteins in the prostate cancer documents, making use of the above mentioned blacklist. To extract a top-20, you can use a command similar to the one used for create the list of most commonly appearing name variants.

Find a protein linked to both diseases

Starting from the top-20 list of prostate cancer proteins, find one or more proteins that are also frequently mentioned in papers on schizophrenia. To do so, you will need to tag also the set of documents on schizophrenia, count the number of mentions of every protein, and check the count of every protein on the top-20 list for prostate cancer in the schizophrenia set.


Editorial: Return from blog hiatus

October 21, 2013

For the past six months this blog has sadly been completely silent. The good news is that I plan to get back to blogging now. My reason for the blog hiatus has been exceptionally busy times in the group.

The Novo Nordisk Foundation Center for Protein Research has undergone its first five-year review. As I was the first group leader to start at the center, I have also just been through the first five-year evaluation of my group, which was concluded with a site visit three days ago by Janet Thornton, Alfonso Valencia, and Dmitrij Frishman. Finally, because I was employed as a professor on a five-year contract, I am in parallel being evaluated for contract renewal.

In parallel with these three big evaluations, my first two Ph.D. students Damian Szklarczyk and Heiko Horn have handed in and successfully defended their theses. Congratulations to both of them!

Enough bad excuses. Time to get back to blogging :-)


Resource: Antibodypedia bulk download file and STRING payload

April 28, 2013

Antibodypedia is a very useful resource for finding commercially available antibodies against human proteins developed by Antibodypedia AB and Nature Publishing Group.

The resource is made available under the Creative Commons Attribution-NonCommercial 3.0 license, which allows for reuse and redistribution of the data for non-commercial purposes. However, the data are purely available for browsing through a web interface, which greatly limits systems biology uses of the resource. I thus wrote a robot to scrape all information from the web resource and convert it into a convenient tab-delimited file, which I have made available for download under the same license. This dataset covers a total of 579,038 antibodies against 16,827 human proteins.

To be able to use the dataset in conjunction with STRING and related resources, I next mapped the proteins to STRING protein identifiers. I was able to map 92% of all proteins in Antibodypedia. Having done this, I created the necessary files for the STRING payload mechanism to be able to show the information from Antibodypedia directly within STRING.

The end result looks like this when searching for the WNT7A protein:

Antibodypedia STRING network

The halos around the proteins encode the type and number of antibodies available. Red rings imply that at least one monoclonal antibody exists whereas gray rings imply that only polyclonal antibodies exist. The darker the ring (be it red or gray), the more different antibodies are available.

They STRING payload mechanism also extends the popups with additional information, here shown for LRP6:

Antibodypedia STRING popup

The popup shows the total number of antibodies available and how many of them are monoclonal. It also provides a direct linkout to the relevant protein page on Antibodypedia.

Please, feel free to use this Antibodypedia-STRING mashup.


Editorial: Goodbye Google Reader – a reminder why open standards matter

March 14, 2013

This morning I woke up to the announcement that Google will be powering down Google Reader, which has long been my RSS reader of choice. RSS feeds crucial to me because it is where I follow numerous science-related blogs, read automated PubMed searches, and receive tables of content from selected journals.

When I recently bought an iPad Mini, however, I discovered to my surprise that there was no Google Reader app for iPad. This made me strongly suspect that Google had no plans to continue Google Reader. It also made me discover Feedly, which turned out to be so good that I preferred reading my RSS feeds on the iPad as opposed to on my computer. I have now installed Feedly for Chrome as well as the Android applet on my phone, so I consider myself fully migrated already. I think this is a lesson that shows the importance of open standards – whereas RSS feeds are crucial to me, replacing the viewer is no big deal.


Announcement: ICSB2013 in Copenhagen

March 8, 2013

It is my great pleasure to announce that I coorganize the 14th International Conference on Systems Biology, which will take place in Copenhagen, Denmark on August 30 – September 3, 2013.

ICSB2013

The conference will feature presentations on a wide spectrum of systems biology topics from a truly spectacular lineup of international, high-profile keynote and session speakers.

Confirmed keynote speakers:
Alexander van Oudenaarden, NL
Anne-Claude Gavin, DE
Ben Neel, CA
Bernhard Palsson, DK
Chris Voigt, USA
Dana Pe’er, USA
Doug Lauffenburger, USA
Elaine Mardis, USA
Gene Myers, DE
Jennifer Lippincott-Schwartz, USA
Kim Sneppen, DK
Lars Steinmetz, DE
Levi Garraway, USA
Marc Vidal, USA
Matthias Mann, DE
Peer Bork, DE
Philippe Bastiaens, DE
Rama Ranganathan, USA
Ruedi Aebersold, CH
Stuart Kauffman, USA
Wendell Lim, USA

Confirmed session speakers:
Bernd Bodenmiller, CH
Bob Murphy, USA
Chris Newgard, USA
Eske Willerslev, DK
Felix Naef, CH
Gerard Manning, USA
Giulio Superti-Furga, AU
Greg Stephanopoulos, USA
Haja Kadarmideen, DK
Hans Westerrhoff, NL
James Faeder, USA
Janine Erler, DK
Jasmin Fisher, UK
Jens Nielsen, DK
Lukas Pelksman, CH
Morten Sommer, DK
Neal Rosen, USA
Norbert Perrimon, USA
Rune Linding, DK
Søren Brunak, DK
Thomas Sicheritz Pontén, DK
Mikkel W. Pedersen, DK
Julio Saez-Rodriguez, UK
Michael Lee, USA
Luis Serrano, ES
Nevan Krogan, USA
Seán O’Donoghue, AU
Jonatahn Karr, USA

Organizing committee
Niels-Henrik Holstein-Rathlou
Søren Brunak
Jens Christian Brings Jacobsen
Lars Juhl Jensen
Jens Christian Brasen
Rune Linding
Morten Sommer
Jørgen K Kanters
Olga Sosnovtseva

To find out more, please check out the conference web site.


Analysis: Science used to be simpler

January 5, 2013

I guess most people have a feeling that life used to be simpler in the past. The other day it occurred to me that we researchers very often talk about how advanced our methods are, although simple methods are in many cases preferable.

So this morning I resorted to my usual strategy for analyzing such things, namely counting in Medline. More specifically I calculated for each year the percentage of publication titles that contain the words “simple” and “advanced”, respectively. In the plot below, the dots show the values for each year and the lines show five-year running averages thereof (click for PDF version):

simple_or_advanced

As can be clearly seen, life as a researcher was indeed simpler in the 50s and 60s.


Follow

Get every new post delivered to your Inbox.

Join 1,081 other followers