B. Y. Chen, “Geometry-based methods for protein function prediction,” PhD thesis, Rice University, Department of Computer Science, Houston, Texas, 2008.
The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Unfortunately, resolving these issues depends partially on a broad and thorough understanding of the biological function of many proteins, and the experimental determination of protein function is expensive and time consuming. In response to this problem, algorithms for computational function prediction have been designed to expand experimental impact by finding proteins with predictably similar function, mapping experimental knowledge onto very similar, unstudied proteins. This thesis seeks to develop one method that can identify useful geometric and chemical similarities between well studied and unstudied proteins. Our approach is to identify matches of geometric and chemical similarity between motifs , representing known functional sites, and substructures of functionally uncharacterized proteins (targets). It is commonly hypothesized that the existence of a match could imply that the target contains an active site similar to the motif. We have designed the MASH (Match Augmentation with Statistical Hypothesis Testing) pipeline, a software tool for computing matches. MASH is the first method to match point-based motifs, developed in earlier work, that represent functional sites as points in space with ranked priorities and alternative chemical labels. MASH is also first to match cavity-aware motifs, a novel contribution of this work, that extend point-based motifs with volumetric information describing active clefts critical to protein function. Controlled experiments demonstrate that matches for both types of motifs can identify cognate active sites. However, motifs can also identify matches to functionally unrelated proteins. For this reason, we developed Motif Profiling (MP), the first method for motif refinement that reduces geometric similarity to functionally unrelated proteins. MP is implemented in two forms: Geometric Sieving (GS) refines point-based motifs and Cavity Scaling (CS) refines cavity-aware motifs. Controlled experimentation demonstrates that GS and CS identify motif refinements that have more matches to functionally related proteins and less matches to functionally unrelated proteins. This thesis demonstrates the importance of computational tools for matching and refining motifs, emphasizing the applicability of large-scale geometric and statistical analysis for functional annotation.