Tag Archives: database

Resource: The TISSUES database on tissue expression of genes and proteins

As mentioned in the last entry, 2015 has been a year of publishing web resources for my group. The COMPARTMENTS and DISEASES databases have yet another sister resource, namely TISSUES.

This web resource allows users to easily obtain a color-coded schematic of the tissue expression of a protein of interest, providing an at-a-glance overview of evidence from database annotations, from proteomics and transcriptomics studies as well as from automatic text mining of the scientific literature:


Whereas the resource integrates all of the above-mentioned types of evidence, the focus in this work was primarily on combining data from systematic tissue expression atlases, produced using a variety of different high-throughput assays. This required extensive work on mapping, scoring, and benchmarking the different datasets to put them on a common confidence scale. The scientific results and details of all those analyses can be found in the article “Comprehensive comparison of large-scale tissue expression datasets”.

Resource: The DISEASES database on disease–gene associations

2015 has been an exceptionally busy year in my group in terms of publishing databases and other web resources; so busy that I have failed to write blog posts describing several of them.

One of them is the DISEASES database, which is described in detail in an article with the informative, if not very inventive title “DISEASES: Text mining and data integration of disease–gene associations”.

The DISEASES database can be viewed as a sister resource to the subcellular localization database COMPARTMENTS, which you can read more about in this blog post. Indeed, the two resources share much of their infrastructure, including the web framework, the backend database, and the text-mining pipeline.

The big difference between the two resources is the scope: whereas COMPARTMENTS links proteins to their subcellular localizations, DISEASES links them to the diseases in which they are implicated. To this end we make use of the Disease Ontology, which turned out to be very well suited for text-mining purposes due to its many synonyms for terms. Text mining is the most important source of associations but is complemented by manually curated associations from Genetics Home Reference and UniProtKB as well as GWAS results imported from DistiLD.

To facilitate usage in large-scale analysis and integration into other databases, all data in DISEASES are available for download. Indeed, the text-mined associations from DISEASES are already included in both GeneCards and Pharos.

Commentary: The sad tale of MutaDATABASE

The problem of bioinformatics web resources dying or moving is well known. It has been quantified in two interesting papers by Jonathan Wren entitled “404 not found: the stability and persistence of URLs published in MEDLINE” and “URL decay in MEDLINE — a 4-year follow-up study”. There is also a discussion on the topic at Biostar.

The resources discussed in these papers at least existed in an operational form at the time of publication, even if they have since perished. The same cannot be said about MutaDATABASE, which in 2011 was published in Nature Biotechnology as a correspondence entitled “MutaDATABASE: a centralized and standardized DNA variation database”. Fellow blogger Neil Saunders was quick to pick up on the fact that this database was an empty shell, but generously gave the authors the benefit of the doubt in his closing statement:

Who knows, MutaDatabase may turn out to be terrific. Right now though, it’s rather hard to tell. The database and web server issues of Nucleic Acids Research require that the tools described be functional for review and publication. Apparently, Nature Biotechnology does not.

Now, almost five years after the original publication, I think it is fair to follow up. Unfortunately, MutaDATABASE did not turn out to be terrific. Instead, it turned out just not to be. In March 2014, about three years after the publication, www.mutadatabase.org looked like this:
MutaDATABASE in 2014

By the end of 2015, the website had mutated into this:
MutaDATABASE in 2015

To quote Joel Spolsky: “Shipping is a feature. A really important feature. Your product must have it.” This also applies to biological databases and other bioinformatics resources, which is why journals would be wise never to publish any resource without this crucial feature.

Analysis: Does a publication matter?

This may seem a strange question to ask for someone working in academia – of course a publication matters, especially if it is cited a lot. However, when it comes to publications about web resources, publications and citations in my opinion mainly serve as somewhat odd proxies on my CV for what really matters: the web resources themselves and how much they are used.

Still, one could hope that a publication about a new web resource would make people aware of its existence and thus attract more users. To analyze this, I took a look at the user statistics of our recently developed resource COMPARTMENTS:

COMPARTMENTS user statistics

Before publishing a paper about it, the web resource had less than 5 unique users per day. Our paper about the resource was accepted on January 26 in the journal Database, which increased the usage to about 10 unique users on a typical weekday. The spike of 41 unique users in a single day was due to me teaching on a course.

So what happened end of June that gave a more than 10-fold increase in the number of users from one day to the next? A new version of GeneCards was released with links to COMPARTMENTS. It seems safe to conclude that the peer-reviewed literature is not where most researchers discover new tools.

Resource: The COMPARTMENTS database on protein subcellular localization

Together with collaborators in the groups of Seán O’Donoghue and Reinhard Schneider, my group has recently launched a new web-accessible database named COMPARTMENTS.

COMPARTMENTS unifies subcellular localization evidence from many sources by mapping all proteins and compartments to their STRING identifiers and Gene Ontology terms, respectively. We import curated annotations from UniProtKB and model organism databases and assign confidence scores to them based on their evidence codes. For human proteins, we similarly import and score evidence from The Human Protein Atlas. COMPARTMENTS also uses text mining to derive subcellular localization evidence from co-occurrence of proteins and compartments in Medline abstracts. Finally, we precompute subcellular localization predictions with the sequence-based methods WoLF PSORT and YLoc. For further details, please refer to our recently published paper entitled “COMPARTMENTS: unification and visualization of protein subcellular localization evidence”.

To provide a simple overview of all this information, we visualize the combined localization evidence for each protein onto a schematic of an animal, fungal, or plant cell:




You can click any of the three images above to go to the COMPARTMENTS web resource. To facilitate use in large-scale analyses, the complete datasets for major eukaryotic model organisms are available for download.

Resource: Antibodypedia bulk download file and STRING payload

Antibodypedia is a very useful resource for finding commercially available antibodies against human proteins developed by Antibodypedia AB and Nature Publishing Group.

The resource is made available under the Creative Commons Attribution-NonCommercial 3.0 license, which allows for reuse and redistribution of the data for non-commercial purposes. However, the data are purely available for browsing through a web interface, which greatly limits systems biology uses of the resource. I thus wrote a robot to scrape all information from the web resource and convert it into a convenient tab-delimited file, which I have made available for download under the same license. This dataset covers a total of 579,038 antibodies against 16,827 human proteins.

To be able to use the dataset in conjunction with STRING and related resources, I next mapped the proteins to STRING protein identifiers. I was able to map 92% of all proteins in Antibodypedia. Having done this, I created the necessary files for the STRING payload mechanism to be able to show the information from Antibodypedia directly within STRING.

The end result looks like this when searching for the WNT7A protein:

Antibodypedia STRING network

The halos around the proteins encode the type and number of antibodies available. Red rings imply that at least one monoclonal antibody exists whereas gray rings imply that only polyclonal antibodies exist. The darker the ring (be it red or gray), the more different antibodies are available.

They STRING payload mechanism also extends the popups with additional information, here shown for LRP6:

Antibodypedia STRING popup

The popup shows the total number of antibodies available and how many of them are monoclonal. It also provides a direct linkout to the relevant protein page on Antibodypedia.

Please, feel free to use this Antibodypedia-STRING mashup.

Resource: Adding bells & whistles to GreenMamba

My latest blog post ended at the stage where we had combined the Instances database and the Motifs tool into a single metatool. In this post I will show how little it takes to add the bells and whistles that turn it into the complete, professional web resource that I showed as a teaser in the first blog post of this series.

You may not want green to be the design color used throughout your web interface. This is easily changed by adding a line like color : #083D65 to your inifile. You can use named colors instead of hex values if you prefer. Whichever color you pick will be used throughout the web interface to ensure a consistent design.

In the simple default design the frame changes size when changing between the Motifs and Instances input forms because the forms are not equally wide. This can easily be changed by setting a fixed width for all lines by adding line such as width : 650px. You do not have to necessarily specify the width in pixels, any units permitted in cascading style sheets can be used.

Most bioinformatics web resources require one or more pages to explain what the resource is all about. Such pages can easily be provided within the GreenMamb framework a by adding lines with the same syntax as page_home. If you add a page_about line, you will get an ABOUT menu item at the top right, which when clicked will show provided HTML text wrapped with within the GreenMamba layout to provide a consistent look. There is nothing magic to the word “about”; for example, if you write page_download you will get a page named DOWNLOAD.

You may want to also add a footer that is shown at the bottom of every page that, for example, mentions who made the resource, whom to contact in case of scientific questions or technical problems, and possibly points to one or more papers that describe the tools and which the user is requested to cite. To insert a footer you simple add a line to the inifile with the keyword footer followed by the text you want shown; this text can contain HTML code.

If you set up a Mamba server to host a single resource, you will want the Mamba server to automatically direct users to the main input form in case they access the server without requesting a specific page. For example, we would want to redirect requests for localhost:8080 to localhost:8008/HTML/ELM. This can be done the [REWRITE] section in the inifile, which allows you to specify simple URL rewrite rules similar to what can be done in Apache.

Below is the inifile required to set up the complete ELM example resource as it was shown in the first blog post of this series:

host : localhost
port : 8080
plugins : ./greenmamba


database : greenmamba/examples/instances.tsv

command : greenmamba/examples/motifs.pl $motif @fasta
page_home : greenmamba/examples/motifs_home.html

tools : Motifs; Instances;
color : #083D65
width : 650px
footer : Disclaimer: This is ELM mirror only serves as an example for the GreenMamba framework. For scientific purposes, please use the real ELM server instead.
page_about : greenmamba/examples/elm_about.html

Starting up the Mamba server with this inifile and accessing localhost:8080 yields this interface:

Clicking the ABOUT link will brings up the contents of the file elm_about.html wrapped with the GreenMamba design elements:

In case you want to include pictures or other content on your pages, you do not need a separate web server to host this. Mamba implements a simple web server that you can use for this purpose; all you have to do is to add a www_dir : <directory> in the [SERVER] section of the inifile and place the files you want to serve within the specified directory.

Finally, the output pages of the metatool are also formatted to follow the design specified in the inifile. The header shows the name if the metatool, color matches that of the other pages, the menu with links to the pages is shown, and the footer is included:

Resource: Combining tools and databases into a single GreenMamba web resource

In the four previous blog posts I introduced the GreenMamba framework (download) and showed how it can be used to turn a simple tab-delimited files or command-line tools into web resources with a bare minimum of effort. In this post I will show how easy it is to configure multiple databases or tools to run under the same Mamba server and how to make them accessible as a single web resource.

To illustrate this, I will take the Instances database and the Motifs tool and turn them into a web resource called ELM (the name of the database from which the instance data and motifs were obtained in the first place). The following inifile is all it takes to do so:

host : localhost
port : 8080
plugins : ./greenmamba

database : greenmamba/examples/instances.tsv

command : greenmamba/examples/motifs.pl $motif @fasta
page_home : greenmamba/examples/motifs_home.html

tools : Motifs; Instances;

The [SERVER] section is exactly as in all the previous examples, instructing the Mamba server to run on localhost port 8080 and to import the GreenMamba plugin. The [Instances] section configures a simple database called Instances based on the tab-delimited file instances.tsv, and the [Motifs] section configures a web tool called Motifs that runs the Perl script motifs.pl. These two sections are unchanged compared to the previous blog posts and have here simply been put into inifile, which is how one hosts multiple databases or tools under the same Mamba server. The last section, [ELM], is the only new part. It instructs GreenMamba to configure a metatool called ELM that combines the two tools Motifs and Instances.

Starting the Mamba server with this inifile and accessing http://localhost:8080/HTML/ELM yields the following web interface:

As you can see, what used to be a tool called Motifs has now become a tab within the resource ELM that shows the same (customized) input form. Similarly, the database Instances has become a tab within the same resource:

If you press the submit button for Motifs or Instances, you will get output that is formatted as it was when using Motifs and Instances as separate resources, the only change being that the header says ELM. In the next blog post, I will show how the design of GreenMamba web resources can be further customized and how design changes are consistently applied throughout all the individual tools that make up the metatool.

Resource: Turning an Excel sheet into a web-accessible database with GreenMamba

Anyone who has worked with computational biology for many years will be familiar with the following situation: from collaborators you have received an Excel spreadsheet, which is generously referred to as a “database”, and you now need to make the data accessible to the world. One could obviously simply provide the file for download; however, it would be much preferred if the data could be searched through a simple web interface.

This is not a particularly difficult job, but it is a fair amount of work to do. Typically you would need set up a database (be that an SQL database or something else), write a CGI script that queries the database and formats the result as an HTML table, and spend some time on web design to make the input and output pages look aesthetically pleasing. It all takes a lot of time that you would probably rather spend on doing something more productive. Consequently this is often not done at all, and data sets that might be of value to others are thus never made available.

One of the key features of the GreenMamba project (see previous blog post on the topic) is to make it as easy as possible to turn any regular Excel spreadsheet into a web database with nearly no work involved. In fact, all it takes is the following four steps:

  1. Download and unpack Mamba.
  2. Save your spreadsheet in tab-delimited format with column names in the first line.
  3. Add the following two lines to your .ini file:
    database : my_spreadsheet.tsv
  4. Start the Mamba server (./mambasrv my_database.ini)

To exemplify this, we downloaded the complete list of 1743 known instances of Eukaryotic Linear Motifs from the ELM database. The following inifile is all it taks to turn the resulting tab-delimited file into a simple web-accessible database:

host : localhost
port : 8080
plugins : ./greenmamba

database : greenmamba/examples/instances.tsv

The [SERVER] tag specifies the host of the computer where the mamba web server actually runs and the plugins variable specifies where to load the plugins that enable the whole green-mamba framework and should always be set to this to work. The [Instances] tag specifies the name of the database and the database points to the tab-delimited version of the spreadsheet. After starting the mamba server you can access http://localhost:8080/HTML/Instances and to see the following query interface (here shown with a query):

Upon submitting the query, GreenMamba retrieves all lines that match the search criteria and formats them as an output page:

One could set up a nicer and simpler version of the database by filtering the tab-delimited file a bit. For example, one might want to leave out the columns ELMType (which is redundant with ELMIdentifer), Accessions, InstanceLogic, Evidence, PDB, and Organism (which is redundant with ProteinName) and rename ELMIdentifier to ELM and ProteinName to Protein. This would result in a simpler query form and a more concise results table. Doing this is left as an exercise for the interested reader.

Resource: Turning databases and tools into web resources with GreenMamba

Today, the users of bioinformatics databases and tools increasingly rely on being able to access them through web interfaces. Almost all major databases and most of the commonly used tools can be accessed in this manner, which is mostly good news from the users perspective. However, in my experience from teaching on numerous courses, these users have never worked with a command line and thus typically run their head against a wall the moment they have to do anything slightly more specialized than, for example, running a BLAST search or making a multiple alignment.

The reason for this is simple: specialist tools and databases are typically not made available through user-friendly web interfaces, because they have too few users to make it worthwhile to create such an interface. Worse yet, the tools are in many cases not even distributed, because the many dependencies and lack of documentation would result in too many questions if one were to distribute it. Consequently, almost every bioinformatician that I have spoken about this has one or more resources that they are currently not sharing – not because they are not willing to share, but because sharing would imply too much extra work. To address this problem, we have developed a web server that allows you to easily wrap existing databases and tools with a web interface like the one shown below.

In my group we are involved in the development and maintenance of many bioinformatics web resources, and I have thus been pushing the development of a reusable infrastructure. The result of this is the Python framework Mamba, which has primarily been developed by Sune Frankild and myself. Briefly, Mamba is a network-centric, multi-threaded queuing system that deals with the many technical aspects related to network communication with the clients and server-side resource management. All the specific work pertaining to a resource is done by modules that run under the Mamba server. GreenMamba is one such Mamba module, which based on a simple configuration file can provide a complete web interface around a tab-delimited data file or a command-line tool.

It is thus with great pleasure that we can now release the first version of the Mamba queuing system and GreenMamba wrapper under the BSD license. We hope that by eliminating most of the work in setting up bioinformatics web resources, it will encourage people to make available data sets and tools that hitherto were not worthwhile the time and effort to set up.

Over the next days and weeks, I plan to publish a series of blog posts that illustrate how one can use this framework to wrap a web interface around existing databases and command-line tools with practically no work. Impatient people are welcome to download the software and look in the greenmamba/examples directory.