Resource: Turning an Excel sheet into a web-accessible database with GreenMamba

Anyone who has worked with computational biology for many years will be familiar with the following situation: from collaborators you have received an Excel spreadsheet, which is generously referred to as a “database”, and you now need to make the data accessible to the world. One could obviously simply provide the file for download; however, it would be much preferred if the data could be searched through a simple web interface.

This is not a particularly difficult job, but it is a fair amount of work to do. Typically you would need set up a database (be that an SQL database or something else), write a CGI script that queries the database and formats the result as an HTML table, and spend some time on web design to make the input and output pages look aesthetically pleasing. It all takes a lot of time that you would probably rather spend on doing something more productive. Consequently this is often not done at all, and data sets that might be of value to others are thus never made available.

One of the key features of the GreenMamba project (see previous blog post on the topic) is to make it as easy as possible to turn any regular Excel spreadsheet into a web database with nearly no work involved. In fact, all it takes is the following four steps:

  1. Download and unpack Mamba.
  2. Save your spreadsheet in tab-delimited format with column names in the first line.
  3. Add the following two lines to your .ini file:
    [NameOfDatabase]
    database : my_spreadsheet.tsv
  4. Start the Mamba server (./mambasrv my_database.ini)

To exemplify this, we downloaded the complete list of 1743 known instances of Eukaryotic Linear Motifs from the ELM database. The following inifile is all it taks to turn the resulting tab-delimited file into a simple web-accessible database:

[SERVER]
host : localhost
port : 8080
plugins : ./greenmamba

[Instances]
database : greenmamba/examples/instances.tsv

The [SERVER] tag specifies the host of the computer where the mamba web server actually runs and the plugins variable specifies where to load the plugins that enable the whole green-mamba framework and should always be set to this to work. The [Instances] tag specifies the name of the database and the database points to the tab-delimited version of the spreadsheet. After starting the mamba server you can access http://localhost:8080/HTML/Instances and to see the following query interface (here shown with a query):

Upon submitting the query, GreenMamba retrieves all lines that match the search criteria and formats them as an output page:

One could set up a nicer and simpler version of the database by filtering the tab-delimited file a bit. For example, one might want to leave out the columns ELMType (which is redundant with ELMIdentifer), Accessions, InstanceLogic, Evidence, PDB, and Organism (which is redundant with ProteinName) and rename ELMIdentifier to ELM and ProteinName to Protein. This would result in a simpler query form and a more concise results table. Doing this is left as an exercise for the interested reader.

4 thoughts on “Resource: Turning an Excel sheet into a web-accessible database with GreenMamba

  1. Kay at Suicyte

    Looks easy enough, but I don’t get the idea what this should be good for. The same thing can be easily accomplished by Excel’s autofilter functionality. And if you don’t like excel, you can do the same with a single line of awk. Or am i missing something ?

    Reply
    1. Lars Juhl Jensen Post author

      I think what you might be missing is who the user of the resulting web database would be. It is not the people creating the data sets – they are typically happy filtering their data in Excel. It is also not the bioinformaticians mining the data, who would typically use AWK, Perl etc. to work on a tab-delimited file. Typically it would be the readers of the paper, who want to look up their favorite gene or protein.

      Also, I’m not sure if you have read the subsequent blog posts on this topic. The main purpose of the database functionality in GreenMamba is to be able to combine small tools and databases within a single web resource.

  2. Kay at Suicyte

    Also, I’m not sure if you have read the subsequent blog posts on this topic No, I hadn’t, but now I have :-) This green Mamba thing is clearly an interesting idea, but (being the usual pessimist) I am still skeptc that it will catch on. The reason for that is not the excel feature, though. Just in case you care, here is what I like and dislike (judging by your blog posts, I was too lazy to actually try it out)

    Excel display: It clearly looks nicer than csv and is easier to use than csv for the general public. I am not sure that it is more convenient than Excel auto filter. Moreover, I am afraid that more and more people will see a ‘single query- single result’ server as a replacement for making their large-scale data publicly available. See e.g. GGbase (https://gygi.med.harvard.edu/ggbase) as a data repository for the recent Gygi paper in Molecular Cell, 44:325-340. The is no downloadable version of the entire dataset, just this server.

    Web interface for command-like programs: This is an excellent idea but I consider it useful only for the most simple cases. There, the mamba pseudocode is much simpler than a full cgi interface. However, if you want to customize the server (using pick lists and such), the necessary efforts gets similar to that required for programming the ‘real thing’. In my time, I have done quite a few web wrappers for CLI programs, typically using perl with CGI.pm. Not the nicest tool, but it offers features like generating html pick lists or radio button sets dynamically based on the data found in a CSV file. I don’t know if this is possible with mamba as well.

    Reply
    1. Lars Juhl Jensen Post author

      I agree with you regarding not making data sets available. However, I should point out that since any GreenMamba resource has a REST interface, you would have been able to easily download the entire data set from GGbase had it been based on GreenMamba.

      I have to disagree with you regarding it being the same about of work to set up a customized resource as it would be to write the CGI script. I have also done that a number of times, and despite that it would still take me much more time to set up something that is even remotely comparable in terms of design.

      More importantly, though, there is more to GreenMamba than what a CGI script would do for you. One issue not discussed much in the blog posts yet is that of resource management. If you set up a web resource with a REST API, you will soon realize what happens when someone hits you hard with a robot (see for example http://string-stitch.blogspot.com/2009/07/100-api-accesses-per-minute-not-good.html). If your underlying script takes any significant amount compute time, you will quickly see the requests build up until your server comes to a grinding hold. In the Mamba framework everything runs in a controlled manner with queuing, maximal number of simultaneous threads, maximum number of simultaneous queued jobs from a single client IP, etc. I need to write up a more technical post about that, but I have not found the time.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s