Functional Annotation of Proteins

Introduction

Broad and extensive knowledge of the biological function of proteins would have immense practical impact on the identification of novel drug targets, the reduction of potential side effects, and on finding the molecular causes of disease. Unfortunately, the experimental determination of protein function is an expensive and time consuming process. In an effort to accelerate and guide the experimental process, computational techniques have been developed to annotate functional information about well-studied proteins onto predictably similar but less-studied proteins.

The Problems

We are currently investigating two problems:

  • Given a substructure that contains information related to the function of a protein (we call such a substructure a motif), can this substructure be matched efficiently to the structure of proteins? If yes, how statistically significant is the match?
  • Given a substructure that contains information related to the function of a protein, can it be improved so that when matches are retrieved they are more sensitive and specific?
  • The MASH pipeline

    To solve the first problem outlined above, we have designed the MASH (Match Augmentation and Statistical Hypothesis Testing) pipeline. MASH can be used to search within functionally uncharacterized protein structures (targets), for substructures with geometric and chemical similarity (matches), to known active sites (motifs). Full details about Match Augmentation can be found in our publications.

    MASH first uses Match Augmentation, a geometric comparison algorithm for identifying substructural geometric and chemical similarity between a protein and a motif. It then employs a hypothesis testing framework to assess the statistical significance of the a match.

    Match Augmentation

    Match Augmentation takes as input a motif S and a target T. MA outputs the match with smallest LRMSD among all matches that fulfill the criteria. Partial matches correlating subsets of S to T are rejected. By establishing a threshold epsilon for acceptable geometric similarity, the second criterion causes MA to return match LRMSDs bounded above by epsilon.

    Hypothesis Testing

    Our statistical model employes a hypothesis testing framework with a nonparametric statistical model, and detects matches with statistically significant geometric and chemical similarity. Match significance is assessed by comparing the match LRMSD to a baseline degree of geometric and chemical similarity, which is established with a reference set of protein structures. We compute our baseline set of matches independently for each motif S, by finding the set of matches between S and a subset of the proteins in the Protein Data Bank.

    Geometric Sieving

    We have also looked at computational methods to characterize and systematize the definition of good motifs. Once such technique is Geometric Sieving.

    The goal of Geometric Sieving is to propose geometric creteria to optimize known motifs and make them more effective motifs. An effective motif retains geometric and chemical similarity to substructures in functional homologs, while maintaining geometric and chemical dissimilarity to with all substructures of functionally unrelated proteins.

    Geometric Sieving operates by measuring Geometric Uniqueness, a property which reflects the median geometric similarity between a motif and a reference set of protein structures. Different motifs have different degrees of Geometric Uniqueness, but we found that geometrically unique motifs tend to be effective as well.

    Geometric Sieving measures Geometric Uniqueness for every subset of a given input motif, but the most direct method involves computing millions of matches. In recent papers we describe one method for identifying the most geometrically unique subset without exhaustive match computation.


    Collaborations: This work has been conducted in collaboration with Dr. Olivier Lichtarge. It is part of a larger project whose overall goal is the automated functional annotation of proteins.

    Related Publications