Statistical models in protein structural alignments
Doctor of Philosophy
Recent advances in protein structure determination techniques have greatly increased the rate at which high-resolution protein structures become available. However, experimental protein function identification techniques are still expensive and time consuming. This makes computational approaches of functional determination, which can be classified into two major groups: sequence- or structure-based, a particularly attractive alternative. One limitation of traditional sequence-based methods is the need for functional landmarks: when proteins do not have sequence orthologues of known function, traditional methods become unreliable. Protein's structure, on the other hand, contains functionality information that is not recoverable from sequence alone, making local structural comparison methods a possible computational alternative, particularly when little or nothing is known about protein's potential function. Many computational tools have been developed to search for structural similarity, but they alone are not enough: effective statistical models are necessary for accurate predictions. In order to assess the quality of a structural match, a successful statistical model needs to be able to take into account individual properties of the search query ( motif ), such as geometry, size, and residue composition. Statistical significance scores should also be independent of parameters used by the structural comparison algorithm, such as heuristics used to speed up the search process. Finally, the model needs to take into account properties of the match itself, such as surface accessibility and structural conservation of the matched residues. Herein, we develop a statistical framework that is not affected by the parameters used in the structural comparison process, and which takes into account the individual properties of the query motif. We test our statistical model, coupled with a successful structural search and comparison algorithm (Match Augmentation), on a dataset consisting of 20 structural motifs representing a range of distinct enzymatic active sites in un-mutated protein structures. We find that our approach exhibits high sensitivity and reasonable specificity. We also apply the approach to a real biological problem in an effort to predict a possible function of a protein, called Rad21, involved in chromosome segregation.