Monthly Archives: December 2015

Job: Text mining position at Intomics A/S

At Intomics A/S, we are looking for a text-mining expert to perform contract research and develop taylor-made solutions. The job will primarily involve solving text-mining problems for clients in the pharmaceutical industry.

Note that the application deadline is January 15, just over two weeks from now. For further details, please read the job advert below the fold.

Continue reading

Resource: The TISSUES database on tissue expression of genes and proteins

As mentioned in the last entry, 2015 has been a year of publishing web resources for my group. The COMPARTMENTS and DISEASES databases have yet another sister resource, namely TISSUES.

This web resource allows users to easily obtain a color-coded schematic of the tissue expression of a protein of interest, providing an at-a-glance overview of evidence from database annotations, from proteomics and transcriptomics studies as well as from automatic text mining of the scientific literature:

tissues_body

Whereas the resource integrates all of the above-mentioned types of evidence, the focus in this work was primarily on combining data from systematic tissue expression atlases, produced using a variety of different high-throughput assays. This required extensive work on mapping, scoring, and benchmarking the different datasets to put them on a common confidence scale. The scientific results and details of all those analyses can be found in the article “Comprehensive comparison of large-scale tissue expression datasets”.

Resource: The DISEASES database on disease–gene associations

2015 has been an exceptionally busy year in my group in terms of publishing databases and other web resources; so busy that I have failed to write blog posts describing several of them.

One of them is the DISEASES database, which is described in detail in an article with the informative, if not very inventive title “DISEASES: Text mining and data integration of disease–gene associations”.

The DISEASES database can be viewed as a sister resource to the subcellular localization database COMPARTMENTS, which you can read more about in this blog post. Indeed, the two resources share much of their infrastructure, including the web framework, the backend database, and the text-mining pipeline.

The big difference between the two resources is the scope: whereas COMPARTMENTS links proteins to their subcellular localizations, DISEASES links them to the diseases in which they are implicated. To this end we make use of the Disease Ontology, which turned out to be very well suited for text-mining purposes due to its many synonyms for terms. Text mining is the most important source of associations but is complemented by manually curated associations from Genetics Home Reference and UniProtKB as well as GWAS results imported from DistiLD.

To facilitate usage in large-scale analysis and integration into other databases, all data in DISEASES are available for download. Indeed, the text-mined associations from DISEASES are already included in both GeneCards and Pharos.

Commentary: The sad tale of MutaDATABASE

The problem of bioinformatics web resources dying or moving is well known. It has been quantified in two interesting papers by Jonathan Wren entitled “404 not found: the stability and persistence of URLs published in MEDLINE” and “URL decay in MEDLINE — a 4-year follow-up study”. There is also a discussion on the topic at Biostar.

The resources discussed in these papers at least existed in an operational form at the time of publication, even if they have since perished. The same cannot be said about MutaDATABASE, which in 2011 was published in Nature Biotechnology as a correspondence entitled “MutaDATABASE: a centralized and standardized DNA variation database”. Fellow blogger Neil Saunders was quick to pick up on the fact that this database was an empty shell, but generously gave the authors the benefit of the doubt in his closing statement:

Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.

Now, almost five years after the original publication, I think it is fair to follow up. Unfortunately, MutaDATABASE did not turn out to be terrific. Instead, it turned out just not to be. In March 2014, about three years after the publication, www.mutadatabase.org looked like this:
MutaDATABASE in 2014

By the end of 2015, the website had mutated into this:
MutaDATABASE in 2015

To quote Joel Spolsky: “Shipping is a feature. A really important feature. Your product must have it.” This also applies to biological databases and other bioinformatics resources, which is why journals would be wise never to publish any resource without this crucial feature.