Tag Archives: genomics

Announcement: From genomes to cells and systems

Later this year Peer Bork, Jeroen Raes, Roland Krause, David Torrents, and I will be organizing the EMBO practical course “Computational biology: From genomes to cells and systems”. It will take place October 14-20 in L’Escala Girona, Catalonia.

In times when high-throughput data are the norm rather than the exception, computational skills to turn masses of data into tangible biological insights have become crucial. This course will teach advanced computational methods for analysis of high-throughput data in molecular biology, covering both inter-individual and inter-species variation in (meta-)genomes and linking it to clinical applications. The course will span protein and pathway level variation from single genomes to entire microbial communities.

To participate in this course, fill in the online application form at the latest July 31, 2012. The registration fee is 250 euros for participants from academia, and 600 euros for industry.

Editorial: Open KPGP – when open means closed

As a computational biologist, I can only be excited about the Personal Genome Project (PGP). What is especially exciting about this particular project is that they release all data under the Creative Commons Zero public domain dediction, which gives everyone complete freedom to use the data as they wish.

The bad news is that there is still only sequence data available for 16 individuals. I was thus thrilled to see the announcement late last year that the Open Korean Personal Genome Project (KPGP) had released data on another 32 individuals. I was a bit mystified why the data were not available for download from the main PGP web site, though.

When I went to the KPGP web site, which has the very promising URL opengenome.net, I was greeted with this message:

However, the moment I tried to download any data, I was faced with the following long legal agreement, which I had to agree to to get any further:

1) Data Type
Every derived genetic information should be approved by relevant facility board.
1. Genetic information data- Individual’s sequenced DNA and analyzed data.
2. Clinical information data- Clinical information does not include family tree, Phenotype and family medical history.
2) The Commission Process
1. Bioethics committee of Genome Research Foundation will make a decision through policy reviews and case consultation.
2. The Standards Commission
This commission should be controlled by Korea National Institute for Bioethics Policy. Research, associate with potential social risks, eugenical problem and discrimination on the basis of genetic information when it comes to any aspects of physical looking, should be forbidden.
3. The Commission Process
Research project and IRB document, approved in each countries, will be required. If there is no provision for IRB approval, User must agree with additional consent documents that embodies the purpose of the data.
4. Evaluation (It will take at least one week)
3) Policy Agrement
Informed consent shall be documented by the use of a written consent form approved by the IRB, and signed by the subject or the subject’s legally authourized representative. And if necessary, the committee may request, require or otherwise obtain detailed investigation.
Data Release
Any additional costs to the subject that may result from participation in the research.
4) Data Source Agrement
The Genome Research Foundation should review and approve specifying the conditions under which data may be accepted, and ensuring adequate provisions to protect the privacy of subjects and maintain the confidentiality of data. To cite the data source in any publications or research based upon these data, and to provide a copy of any publications, the following citation should be included in any research reports, papers, or publications based on these data: Produced and distributed data should have references in Acknowledgement, Methods, Abstract.
5) Genetic Data Access Use Agreement
1. To use the data set solely for statistical reporting and analysis.
2. Not to share these data with, or provide copies of these data to, any other person or organization. Genetic data user will not use for commercial interests or potential commercialization of the results bring troubling ethical aspects the suggest greater potential abuses than clinical benefits.
3. To make no attempt to link this data set with individually identifiable records from any source, or in any other way attempt to identify the persons in this or other datasets.
4. Personal data will neither be disclosed to any exterior third parties nor be used for any other purposes.
5. That if the identity of any person or establishment in this data set is inadvertently discovered, then (a) no use will be made of this knowledge, (b) the Director of Genome Research Foundation will be advised of this incident immediately (c) the information that would identify any individual or establishment will be safeguarded or destroyed, as requested by Genome Research Foundation, and (d) no one else will be informed of the discovered identity.
6. To return or destroy the data set, and any derivative data files, upon request from Genome Research Foundation.
7. This agrement is contingent upon the approved Genome Research Foundation, and is subject to all the requirements of that agreement.

For those who cannot or do not want to read (poorly formatted and phrased) legalese, this is the polar opposite of open. It explicitly forbids redistribution, commercial use, and deidentification of individuals. It even goes as far as requiring that if I use the data in a publication, I must cite KPGP in the abstract. It is in other words closed.

To add insult to injury, I subsequently filled in a form to request access and waited for weeks for someone to grant me an account, only to discover that I cannot download the data even when logged in with said account. Instead the web site requests me to go through the same approval procedure again.

Analysis: 10butnotMe

About five years ago George Church announced the Personal Genome Project (PGP). A very interesting aspect of this project is that all data are released under the Creative Commons Zero waiver. This includes not only the genetic data, but also some medical information and even the identity of each individual.

Although PGP has enrolled more than a thousand individuals, it is presently only possible to download data on ten individuals. It is obviously pointless to attempt to link genotype to phenotype based on such a small number of individuals. However, I wondered if any meaningful structure would emerge if I calculated the Hamming distances for all pairs of individuals, that is the number of SNPs by which they differ (download).

Like said so done. I downloaded all available SNP data from PGP (including array and exome sequencing data), calculated all pairwise SNP distances, and visualized the results as a heatmap along with the faces of the individuals (click for a larger version of the figure):

Number of SNP differences between PGP10 individuals

Individual #10 stands out as being genetically most dissimilar from everyone else, which is unsurprising as he is the only African American in the study. I next tried to similarly define the genetically most average individual, that is the individual that is most similar to everyone else. If one defines this as the individual with the lowest sum of differences, the answer is individual #7. However, because the origins of his grandparents are unknown, it is difficult to conclude anything interesting based on this.