The LabelHash algorithm consists of two stages: a preprocessing stage and a matching stag. During the preprocessing stage we build hash tables for a set of target PDB files. This has to be done only once for a given set of targets. During the matching stage we can quickly look up partial matches in these has tables for almost any motif. The matching algorithm computes complete matches and their statistical significance.
The workflow for using the LabelHash command line programs consists of the following steps:
We will describe these steps in detail below. We have also included a section on batch processing the matching of motifs.
LabelHash relies on an external program, called msms, to compute molecular surfaces. Binaries of this program are available for most common platforms. Please download msms before you start using the LabelHash programs. The main binary is called “msms.<platform.version>.” Rename it to “msms” and put it somewhere in your $PATH. The LabelHash matching program produces files that can be opened in Chimera using our ViewMatch plugin.
LabelHash works on (gzipped) PDB files, so you need to download some PDB files for motifs, a background data set, and homologs. You need to set the environment variable PDBPATH to the directories where you store the PDB files. If bash is your shell, add something like this to your .bash_profile:
export PDBPATH=${HOME}/labelhash/motifs/pdbfiles:${HOME}/nrpdb
If you would like to store or update a local copy of the PDB on your machine, then the mknrpdb.py script can help with that. The main purpose of this script is that it creates a directory with a nonredundant subset of the PDB. The files in the directory it creates are symbolic links, so that the additional storage for this is minimal if you already have a local copy of the PDB.
To match a motif, we first need to create LabelHash tables of the PDB files that are to be used as matching targets. First, create a text file with all the PDB files you want to build tables for:
cd ~/nrpdb; \ls -1 *.ent.gz > pdblist.txt
For EC/GO classes, you can copy and paste the filenames from the http://www.pdb.org web site, just before you start downloading them all. The next step is to create XML input files with the create_input.py script:
./create_input.py num_xml_files=10 nrpdb_ ~/nrpdb/pdblist.txt
The option "num_xml_files=10" says that we will build 10 LabelHash tables, each with one tenth of the files listed in pdblist.txt. With the default settings, it is a good idea to keep the number of PDB files per XML file to 600 or less. Otherwise, the LabelHash tables become rather large and you may run out of memory. The script will create nrpdb_0.xml, ... , nrpdb_9.xml, which will be used in the last step. There are many other parameters to this script with which you can change the default settings. Run the script without parameters to see what they are. The final step is creating the LabelHash tables:
./create nrpdb_*.xml
After "create" is finished you should now have the files nrpdb_0.lhash3, ... , nrpdb_9.lhash3.
To match a motif, we first need to create a motif XML file and a matching options XML file. Both can be created with the match_input.py script like so:
./match_input.py prefix=1ady 1ady.pdb 81:ED 83:T 112:RS 130:ED 264:YL 311:RNKQ
This will create the files 1ady-motif.xml and 1ady-options.xml. The script expects the name of a PDB file followed by a number of residues, specified by the residue sequence ID. Each residue sequence ID can optionally be followed by a colon and a number of one-letter residue names, which are taken to be all allowed residue labels for that motif point. If the residue names are omitted, the residue label in the PDB file is used. The values for all the options in 1ady-options.xml can be changed with optional parameters to the match_input.py script. Call the script without parameters to see all options. The motif can be matched against the targets in the LabelHash tables we computed before:
./match 1ady-options.xml 1ady-motif.xml nrpdb_*.lhash3
The matches will by default be saved in 1ady-matches.xml. You can specify a different output file like so:
./match -o myfile.xml 1ady-options.xml 1ady-motif.xml nrpdb_*.lhash3
The resulting file can be opened with the ViewMatch plugin. If you want to look at any of the XML files used or produced by LabelHash from the command line, use the "xm" script as your viewer.
It if often convenient to match the same motif several times to LabelHash tables with different sets of options, or match a number of motifs with the same options. Typically, we want to match a motif against a set of homologs and a nonredundant version of the PDB. From these matches, we can then compute the true and false positive rate. There are several scripts that can help you organize your experiments. They are tailored to our workflow, but it should not be too hard to customize them for your needs. The example input files in the “motifs” subdirectory, included in the LabelHash distyribution, are used by these scripts. Let us consider the files specific to one motif, 1ady:
We also have a match options file called 7A.xml, which we will use for matching. To create the 1ady motif files and compute the homolog matches, we simply type:
./preprocess.py ./motifs ./motifs/7A.xml 1ady 81:ED 83:T 112:RS 130:ED 264:YL 311:RNKQ
The first argument tells the script where to find input files and store the output. The second argument is the name of the motif. The remaining arguments specify the motif. The file 1ADY.pdb.gz and the PDB files for all the homologs are assumed to be stored in ./motifs/pdbfiles. After the script is finished, we should now have the following files:
If you call the preprocess script with only the first argument, it will create the corresponding files for all the example motifs. This can take a while.
Each motif can be matched against a number of LabelHash tables for your background data set. If your cluster uses PBS for job scheduling, then you could use the gencreatejobs.py to generate and submit the PBS scripts for creating the LabelHash tables for your background data set. You can then use genmatchjobs.py to submit jobs that match a number of motifs against your background data set.
After you have produced matches of your motif against the set of homologs and the background data set, you are often interested in finding the true/false positive rate. For this, you can use the postprocess.py script. The postprocess script assumes that matches against the background data set are stored in the file motifname_optionname.xml. So in our 1ady example, the script expects to find a file 1ady_7A.xml. The postprocess.py is called like so:
./postprocess.py ./motifs 7A .001 1ady
The first argument is the directory where the script can find all the necessary files. The second argument is the name of the options file used for matching. The third argument is the p-value. The last argument is the name of the motif. The script will print the true positive rate and false positive rate for EC/GO homologs (it will print 0's if EC/GO homolog information is not found). It will actually print two tables, one with a pointweight correction (a statistical correction for bias introduced by matching parameters) and one without.