In this exercise you will be processing two sets of documents on prostate cancer and schizophrenia, respectively. You will use a program called tagdir along with a pre-made dictionary of human protein names to identify proteins relevant for each disease and to find a protein that links the two diseases to each other.
The software and data files required for this exercise is available as a tarball. You will need to know how to compile a C++ program to run this exercise; also, the C++ program requires Boost C++ Libraries.
How to run the tagger
To execute the tagger program, you need to run a command with the following format:
tagdir dictionary_file blacklist_file documents_directory > matches_file
The argument dictionary_file should be the name of a tab-delimited file containing the dictionary of names to be tagged in the text. For this exercise you will always want to use the file human_proteins.tsv.
The argument blacklist_file should be the name of a tab-delimited file containing the exact variants of names that should not be tagged despite the name being in the dictionary. To tag every variant of every name, you can simply use an empty file (empty.tsv), but you will also be making your own file later as part of the exercise.
The argument documents_directory should be the name of a directory containing text files with the documents to be processed. For this exercise two such directories will be used, namely prostate_cancer_documents and schizophrenia_documents.
Finally you give a name for where to put the tab-delimited output from the tagger (matches_file). You can call these files whatever you like, but descriptive names will make your life easier, for example, prostate_cancer_matches.tsv.
When running the tagger on one of the two directories of documents, you should expect the run to take approximately 2 minutes.
The tagger program requires the input to be in very specific formats. The dictionary_file must be a tab-delimited file that looks as follows:
The first column must be a number that uniquely identifies the protein in question. In human_proteins.tsv the numbers used are the ENSP identifers from Ensembl, with the letters ENSP and any leading zeros removed. The second column is a name for the protein. If a protein has multiple names, there will be several lines, each listing the same number (i.e. the same protein) but a different name for it.
The blacklist_file must similarly be a tab-delimited file in which each line specifies a specific variant of a name and whether it should be blocked or not:
The letter “t” in the second column means TRUE (i.e. that the variant should be blocked) and “f” means FALSE (i.e. that it should not be blocked). The file above would thus block ACP, ACS, and act but not ACTH and activin. Because variants are by default not blocked, you in principle only need to add lines with “t”; however, the lines with “f” can be very useful to keep track at which variants you have already looked at and actively decided not to block, as opposed to variants that are not blocked because you have not looked at them.
The output of the tagger is also tab-delimited, each line specifying a possibly meaning of a match in a document:
7478532.txt 254 256 TSG 262120
7478532.txt 254 256 TSG 416324
7478532.txt 595 597 LPL 309757
7478532.txt 595 597 LPL 315757
7478532.txt 658 661 NEFL 221169
7478532.txt 736 738 LPL 309757
7478532.txt 736 738 LPL 315757
The first column tells the name of the file (the numbers are in our case the PubMed IDs of the abstracts). The next two columns specify from which character which character there is a match. The fourth column shows which string appeared in that place in the document. The fifth column specific which protein this could be (i.e. the Ensemble protein number). The first line in the output above thus means that the document 7478532.txt from character 254 to character 256 says TSG, which could mean the protein ENSP00000262120. The second line shows that it alternatively could be ENSP00000416324.
Making a blacklist
As you will see if you run the tagdir using an empty blacklist_file, a simple dictionary matching approach leads to very many wrong matches unless it is complemented by a good list of name variants to be blocked.
To identify potential name variants that you might want to put on the black list, you will want to count the number of occurrences of each and every name variant in an entire corpus of text. A very helpful UNIX command for doing this based on a matches_file (produced by tagdir) is:
cut -f 4 matches_file | sort | uniq -c | sort -nr | less
What it does is first cut out the exact name variants in column 4 of your matches file (cut -f 4), sort them so that multiple copies of the same variant will be right after each other (sort), count identical adjacent lines (uniq -c), sort that list reverse numerical so that the highest counts come first (sort -nr), and show the resulting in a program so that you can scroll through it (less).
The next step is to simply go through the most frequently occurring name variants and manually decide which ones are indeed correct protein names that just occur very frequently, and which ones occur very frequently because they mean something entirely different than the protein. The latter should be put on your blacklist.
Once you have produced a blacklist, you can rerun the tagdir command and you will get much less output because vast numbers of false positive matches have been filtered away. You will find that adding just the worst few offenders to the blacklist will help much; however, as you go further down the list the effort spent on inspecting more names gives diminishing returns.
Finding proteins associated to a disease
Making a very good blacklist would take too long for this exercise. For the following exercise it is thus recommended that you use the provided file blacklist_10.tsv, which we have provided.
The goal is now to produce a top-20 list of proteins that are most often mentioned in papers about prostate cancer. To achieve this, you need to rerun the tagging of human proteins in the prostate cancer documents, making use of the above mentioned blacklist. To extract a top-20, you can use a command similar to the one used for create the list of most commonly appearing name variants.
Find a protein linked to both diseases
Starting from the top-20 list of prostate cancer proteins, find one or more proteins that are also frequently mentioned in papers on schizophrenia. To do so, you will need to tag also the set of documents on schizophrenia, count the number of mentions of every protein, and check the count of every protein on the top-20 list for prostate cancer in the schizophrenia set.