[-] Show simple item record

dc.contributor.advisorLee, Yugyung, 1960-
dc.contributor.authorAyoub, Ronald
dc.date.issued2021
dc.date.submitted2021 Spring
dc.descriptionTitle from PDF of title page viewed July 7, 2021
dc.descriptionYugyung Lee
dc.descriptionVita
dc.descriptionIncludes bibliographical references (pages 149-158)
dc.descriptionThesis (Ph.D.)--School of Computing and Engineering and Department of Mathematics and Statistics. University of Missouri--Kansas City, 2021
dc.description.abstractGiven the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either compromise the quality of results to obtain faster response times or suffer from longer response times to provide better quality results. Existing protein structure searches that focus on faster response times often use sequence clustering or depend on other simplifying assumptions not based on structure alone. In the case of sequence clustering, strong structure similarities are often hidden behind cluster representatives. Existing protein structure searches that focus on better quality results often perform full pairwise protein structure alignments with the query structure against every available structure in the searched database, which can take as long as a full day to complete. The poor response times of these protein structure searches prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. To address these trade-offs between faster response times and quality results, we have developed RUPEE, a fast and accurate purely geometric protein structure search combining a novel approach to encoding sequences of torsion angles with established techniques from information retrieval and big data. RUPEE can compare the query structure to every available structure in the searched database with fast response times. To accomplish this, first, we introduce a new polar plot of torsion angles to help identify separable regions of torsion angles and derive a simple encoding of torsion angles based on the identified regions. Then, we introduce a heuristic to encode sequences of torsion angles called Run Position Encoding to increase the specificity of our encoding within regular secondary structures, alpha-helices and beta-strands. Once we have a linear encoding of protein structures based on their torsion angles, we use min-hashing and locality sensitive hashing, established techniques from information retrieval and big data, to compare the query structure to every available structure in the searched database with fast response times. Moreover, because RUPEE is a purely geometric protein structure search, it does not depend on protein sequences. RUPEE also does not depend on other simplifying assumptions not based on structure alone. As such, RUPEE can be used effectively to search on protein structures with low sequence and structure similarity to known structures, such as predicted structures that results from protein structure prediction algorithms. Comparing our results to the mTM-align, SSM, CATHEDRAL, and VAST protein structure searches, RUPEE has set a new bar for protein structure searches. RUPEE produces better quality results than the best available protein structure searches and does so with the fastest response times.
dc.description.tableofcontentsIntroduction -- Encoding Torsion Angles -- Indexing Protein Structures -- Searching Protein Structures -- Results and Evaluation -- Using RUPEE -- Conclusion -- Appendix A. Benchmarks of Known Protein Structures -- Appendix B. Benchmarks of Protein Structure Predictions
dc.format.extentxvi, 159 pages
dc.identifier.urihttps://hdl.handle.net/10355/84169
dc.subject.lcshProteins -- Structure
dc.subject.lcshBig data
dc.subject.otherDissertation -- University of Missouri--Kansas City -- Computer science
dc.subject.otherDissertation -- University of Missouri--Kansas City -- Mathematics
dc.titleRUPEE: A Big Data Approach to Indexing and Searching Protein Structures
thesis.degree.disciplineComputer Science (UMKC)
thesis.degree.disciplineMathematics (UMKC)
thesis.degree.grantorUniversity of Missouri--Kansas City
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)


Files in this item

[PDF]

This item appears in the following Collection(s)

[-] Show simple item record