Analysis: Half of published URLs are dysfunctional a decade later

As a small aside when setting up a local mirror of Medline, I extracted 15,915 URLs that were mentioned in the abstracts. Checking them revealed that 12,354 of them (78%) were functional, which may not seem that bad. However, plotting the percentage of dysfunctional URLs as a function of publication year reveals a less pleasant trend:

Dysfunctional URLs

After just 10 years, half of all published URLs are no longer functional, and do not redirect to the new location of the service (if one exists). The fairly high success rate overall is merely a consequence of most URLs having been published within the last few years. Unless the persistence of URLs is improving (which I see no sign of in the plot), we can thus expect to have thousands of URLs in the published literature that are no longer valid.

Edit: Andrew Lang pointed out a similar study of URLs cited in communications journals.

Edit: Duncan Hull pointed out a paper on URL decay in Medline by Jonathan Wren, which reminded me of an even earlier paper on the topic.

2 thoughts on “Analysis: Half of published URLs are dysfunctional a decade later

  1. axfelix

    I can’t help but think that the real danger of this is overstated — in most cases, any URL which still contains useful information after a decade has been cached or mirrored elsewhere such that it’s usually a <30 second endeavour to locate the new address. It's certainly more troubling from a systematic data mining perspective, but how often does one perform empirical analysis on a collection of scattershot URLs rather than a single database? The only domain that really suffers is manual indexing, and we oughtn't be doing that except in select circumstances these days, anyway.

    Reply
  2. Lars Juhl Jensen Post author

    Valid point. It is obviously difficult to quantify for how many of the broken URLs a human being would be able to find the new location of the content, and even harder to quantify how much of the content is no longer available anywhere on the web.

    I did a very simple quick attempt at this, by picking ten random, broken URLs from year 2000 and earlier. Starting from the PubMed abstracts, I was able to Google my way to the new location of six of them. My guess is thus that the content can be found for about half of the broken URLs with a bit of effort.

    One additional observation is that the broken URLs that I could manually “resolve” all pointed to online databases, whereas the ones I could not were mostly supplementary information. However, note that this is based on looking at only ten cases.

    Reply

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s