We have also looked at computational methods to characterize and systematize the definition of good motifs. Once such technique is Geometric Sieving. The goal of Geometric Sieving is to propose geometric criteria to optimize known motifs and make them more effective motifs. An effective motif retains geometric and chemical similarity to substructures in functional homologs, while maintaining geometric and chemical dissimilarity to with all substructures of functionally unrelated proteins.
Geometric Sieving operates by measuring "geometric uniqueness," a property which reflects the median geometric similarity between a motif and a reference set of protein structures. Different motifs have different degrees of geometric uniqueness, but we found that geometrically unique motifs tend to be effective as well. Geometric Sieving measures geometric uniqueness for every subset of a given input motif, but the most direct method involves computing millions of matches. In recent papers we describe one method for identifying the most geometrically unique subset without exhaustive match computation.
Designing high-quality motifs that accurately represent the most significant and unique structural features of the protein being model is critical to the function prediction performance of motif matching algorithms (such as LabelHash/MASH). Amino acids likely to have functional importance have often been found to reside within large cavities and clefts on protein surfaces, but these clefts can grow quite large, making them difficult to model. Often times in these large clefts or binding site regions, a much smaller subset of the amino acids present are more directly involved with the functional activity of the cavity. Given a larger input motif, Geometric Sieving identifies a smaller subset of amino acids that still uniquely represent the function of the modeled protein by statistically comparing the "geometric uniqueness" of each amino subset.
For any given motif, we can match it against either the entire PDB or a representative subset, such as the non-redundant PDB (nrPDB) or curated subsets, such as those by CATH and SCOP. Matching against one of the background sets provides us with a "profile" for a given motif, such as that shown below.
A desirable motif is geometrically similar (low Least Root Mean Square Deviation) to functionally homologous proteins and dissimilar to functionally unrelated proteins. By identifying those motifs that maximize the separation of related and unrelated proteins, high-performance motifs can be identified. For each subset motif, a profile is calculated by matching against the given background set as shown below.
After computing these profiles (which are actually probability density functions) for each subset motif, we then select the subset motif that pushes the median of the profile farthest to the right (to higher LRMSD values). This selected subset motif is then returned as a geometrically unique subset of the original input motif. Examples of the subset motif profiles computed for two different input motifs are shown below. Each blue curve is a subset motif profile; the two black curves are the least and most unique subsets.