RUPEE: A Big Data Approach to Indexing and Searching Protein Structures
Date
2021Metadata
[+] Show full item recordAbstract
Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either compromise the quality of results to obtain faster response times or suffer from longer response times to provide better quality results. Existing protein structure searches that focus on faster response times often use sequence clustering or depend on other simplifying assumptions not based on structure alone. In the case of sequence clustering, strong structure similarities are often hidden behind cluster representatives. Existing protein structure searches that focus on better quality results often perform full pairwise protein structure alignments with the query structure against every available structure in the searched database, which can take as long as a full day to complete. The poor response times of these protein structure searches prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. To address these trade-offs between faster response times and quality results, we have developed RUPEE, a fast and accurate purely geometric protein structure search combining a novel approach to encoding sequences of torsion angles with established techniques from information retrieval and big data. RUPEE can compare the query structure to every available structure in the searched database with fast response times.
To accomplish this, first, we introduce a new polar plot of torsion angles to help identify separable regions of torsion angles and derive a simple encoding of torsion angles based on the identified regions. Then, we introduce a heuristic to encode sequences of torsion angles called Run Position Encoding to increase the specificity of our encoding within regular secondary structures, alpha-helices and beta-strands. Once we have a linear encoding of protein structures based on their torsion angles, we use min-hashing and locality sensitive hashing, established techniques from information retrieval and big data, to compare the query structure to every available structure in the searched database with fast response times. Moreover, because RUPEE is a purely geometric protein structure search, it does not depend on protein sequences. RUPEE also does not depend on other simplifying assumptions not based on structure alone. As such, RUPEE can be used effectively to search on protein structures with low sequence and structure similarity to known structures, such as predicted structures that results from protein structure prediction algorithms. Comparing our results to the mTM-align, SSM, CATHEDRAL, and VAST protein structure searches, RUPEE has set a new bar for protein structure searches. RUPEE produces better quality results than the best available protein structure searches and does so with the fastest response times.
Table of Contents
Introduction -- Encoding Torsion Angles -- Indexing Protein Structures -- Searching Protein Structures -- Results and Evaluation -- Using RUPEE -- Conclusion -- Appendix A. Benchmarks of Known Protein Structures -- Appendix B. Benchmarks of Protein Structure Predictions
Degree
Ph.D. (Doctor of Philosophy)