Functional Annotation of Proteins

Introduction

Broad and extensive knowledge of the biological function of proteins would have immense practical impact on the identification of novel drug targets, the reduction of potential side effects, and on finding the molecular causes of disease. Unfortunately, the experimental determination of protein function is an expensive and time consuming process. In an effort to accelerate and guide the experimental process, computational techniques have been developed to use functional information about well-studied proteins to annotate predictably similar but less-studied proteins.

The Problems

We are currently investigating the following problems:

Substructure Matching

We have developed an algorithm called LabelHash to quickly identify substructure matches in a large set of target structures. The algorithm preprocesses the targets to create hash tables that allow for the look up of partial matches in constant time. With a variant of our previous substructure matching algorithm (called MASH) complete matches are computed from these partial matches. Structures are currently represented by C-alpha coordinates labeled with the corresponding residue types. With our nonparametric statistical model we can assign a p-value to each match. The LabelHash algorithm has been made accessible through a web server and can also be downloaded. The algorithm runs in parallel. Typically, computing matches for a 5–10 residue motif in the entire 95% sequence identity filtered non-redundant PDB takes only a couple of minutes on our 8-core web server. The matches can be visualized and further analyzed in Chimera with our ViewMatch plugin. Below, a match (in green) is shown superimposed with a motif (in white), while the rest of the matching protein is shown in ribbon representation.

Related Publications