Monthly Archives: January 2010

Exercise: Using the STITCH database

The STITCH database contains functional associations among proteins and small molecules.

Try searching STITCH with the human thymidylate synthase (TYMS) protein as input. The resulting network includes several small molecules.

Questions:

  • Can you identify the products of thymidylate synthase among them?
  • Are the reactants also present in the network?

Sometimes the proteins or small molecules that you search for may not be immediately shown by STITCH. To find what you are looking for you may have to extend the network.

Questions:

  • Do the small molecules that were missing in the questions above appear when clicking the Add nodes button?
  • Can you construct a clearer network with fewer interactions by changing the network parameters at the bottom of the page?

Thymidine is required for DNA replication and repair to take place, and inhibition of thymidine synthase is thus harmful to proliferating cells. Indeed, most of the small molecules in the network are drugs used for chemotherapy.

Questions:

  • Are these drugs structurally similar to each other?
  • Are they similar to substrate of thymidylate synthase?
  • Can you suggest a mechanism of action?

Analysis: Correlating the PLoS article level metrics

A few months ago, the Public Library of Science (PLoS) made available a spreadsheet with article level metrics. Although others have already analyzed these data (see posts by Mike Chelen), I decided to take a closer look at the PLoS article level metrics.

The data set consists of 20 different article level metrics. However, some of these are very sparse and some are partially redundant. I thus decided to filter/merge these to create a reduced set of only 6 metrics:

  1. Blog posts. This value is the sum of Blog Coverage – Postgenomic, Blog Coverage – Nature Blogs, and Blog Coverage – Bloglines. A single blog post may obviously be picked up by multiple of these resources and hence be counted more than once. Being unable to count unique blog posts referring to a publication, I decided to aim for maximal coverage by using the sum rather than using data for only a single resource.
  2. Bookmarks. This value is the sum of Social Bookmarking – CiteULike and Social Bookmarking – Connotea. One cannot rule out that a single user bookmarks the same publication in both CiteULike and Connotea, but I would assume that most people use one or the other for bookmarking.
  3. Citations. This value is the sum of Citations – CrossRef, Citations – PubMed Central, and Citations – Scopus. I decided to use the sum to be consistent with the other metrics, but a single citation may obviously be picked up by more than one of these resources.
  4. Downloads. This value is called Combined Usage (HTML + PDF + XML) in the original data set and is the sum of Total HTML Page Views, Total PDF Downloads, and Total XML Downloads. Again the sum is used to be consistent.
  5. Ratings. This value is called Number of Ratings in the original data set. Because of the small number of articles with rating, notes, and comments, I decided to discard the related values Average Rating, Number of Note threads, Number of replies to Notes, Number of Comment threads, Number of replies to Comments, and Number of ‘Star Ratings’ that also include a text comment.
  6. Trackbacks. This value is called Number of Trackbacks in the original data set. I was greatly in doubt whether to merge this into the blog post metric, but in the end decided against doing so because trackbacks do not necessarily originate from blog posts.

Calculating all pairwise correlations among these metrics is obviously trivial. However, one has to be careful when interpreting the correlations as there are at least two major confounding factors. First, it is important to keep in mind that the PLoS article level metrics have been collected across several journals. Some of these journals are high impact journals such as PLoS Biology and PLoS Medicine, whereas others are lower impact journals such as PLoS ONE. One would expect that papers published in the former two journals will on average have higher values for most metrics than the latter journal. Papers published in journals with a web-savvy readership, e.g. PLoS Computational Biology, are more likely to receive blog posts and social bookmarks. Second, the age of a paper matters. Both downloads and in particular citations accumulate over time. To correct for these confounding factors, I constructed a normalized set of article level metrics, in which each metric for a given article was divided by the average for articles published the same year in the same journal.

I next calculated all pairwise Pearson correlation coefficients among the reduced set of article level metrics. To see the effect of the normalization, I did this for both the raw and the normalized metrics. I visualized the correlation coefficients as a heat map, showing the results for the raw metrics above the diagonal and the results for the normalized metrics below the diagonal.

There are a several interesting observations to be made from this figure:

  • Downloads correlate strongly with all the other metrics. This is hardly surprising, but it is reassuring to see that these correlations are not trivially explained by age and journal effects.
  • Bookmarks is the metric that apart from number of downloads correlates most strongly with Citations. This makes good sense since CiteULike and Connotea are commonly used as reference managers. If you add a paper to you bibliography database, you will likely cite it at some point.
  • Blog posts and Trackbacks correlate well with Downloads but poorly with citations. This may reflect that blog posts about research papers are often targeted towards a broad audience; if most of the readers of the blog posts are laymen or researchers from other fields, they will be unlikely to cite the papers covered in the blog posts.
  • Ratings correlates fairly poorly with every other metric. Combined with the low number of ratings, this makes me wonder if the option to rate papers on the journal web sites is all that useful.

Finally, I will point out one additional metrics that I would very much like to see added in future versions of this data set, namely microblogging. I personally discover many papers through others mentioning them on Twitter or FriendFeed. Because of the much smaller the effort involved in microblogging a paper as opposed to writing a full blog post about it, I suspect that the number of tweets that link to a paper would be a very informative metric.

Edit: I made a mistake in the normalization program, which I have now corrected. I have updated the figure and the conclusions to reflect the changes. It should be noted that some comments to this post were made prior to this correction.

Editorial: What is the difference between Twitter and Second Life?

I admit that this may seem a strange question. One is a microblogging platform that allows you read and write messages of at most 140 characters. The other is a 3D virtual world. They are, however, both communication tools, and I think there is a completely different reason why Twitter is so much more useful to me than Second Life is.

If I go to the Twitter web site and log in, I see this:

It is Twitter. It immediately shows me the main content: tweets. It also allows me to create content, that is to tweet.

By contrast, if I go to the Second Life web site and log in, I see this:

It is not Second Life. It is a complex web interface that gives me access to account administration tools and shows me lists of blog posts by Linden lab, comments from Second Life users, items for sale on Xstreet SL, and video tutorials. In the lower left corner it shows me the only really useful information, namely which of my friends are online. That is, the friends that I would have been able to chat with, had I been in Second Life and not on the web site, which does not allow you to read or write messages.

Imagine if the web interface of Second Life would instead show me this:

It would be Second Life. It would immediately show me the main content: the virtual world. It would also allow me to interact with the content, that is to move around, to chat with people, and even to create content. Do you think Second Life would have more users if it would run inside your web browser? I think so. Linden Lab is and has been focusing on improving the initial user experience in Second Life to improve the retention rate (i.e. the fraction of new users that continue to come back). I am not saying that this is not important, but I think that most of the potential users are lost long before they even get into the virtual world.

This is by no means a problem that is specific to Second Life. Today, asking users to install a piece of software on their computer will cause the majority of people to shy away before they have tried your product. Even just asking users to create an account will cause many to turn around and walk away. When it comes to social networks, the decisive factor is users. If your friends are not there, why should you? Imagine a virtual world that would run in your web browser and which you could sign into using OpenID, Twitter Connect, or Facebook Connect. Would your friends be there? Would you?