Protein-DNA interaction prediction and protein structure modeling by machine learning
Abstract
Proteins are large, complex molecules that perform most essential functions within organisms. In this work, we mainly focus on two important aspects that determine their functional properties: the tertiary structure of the proteins and their interaction patterns with the genome. Understanding these properties brings valuable insights on the fundamentals of biology and result in new applications in areas such as agriculture, precision medicine, and drug discovery. The recent developments of bioinformatics and structural biology, machine learning, in particular deep learning has proven to be extremely powerful in inference and interpretation of experimental observations by taking advantage of the large amount data publicly available today. We aim to propose novel machine learning frameworks that can both extract information from higher-level features, and provide explainability for meaningful insights beyond the predictions as well. However, due to the volatility of biology phenomena, the design of data processing and modeling need to be extensive for features from the the proteins. Also, the different geophysical measurements (1D, 2D and 3D) of the protein properties bring new challenges for the selection of model architectures that can effectively leverage different forms of data structure. In this dissertation, four major contributions are described. First, DeepGRN, is a method for transcription binding site prediction using 1D transformer-based network. Second, GNET2, is a data-assisted method to infer the interactions between proteins and genes from gene expression data using decision tree and information theory. Third, ATTContact, is a tool for protein contact prediction based on 2D residual neural networks with attention mechanism. Finally, EnQA, a method based on 3D equivariant graph networks for protein model quality assessment and selection of the most accurate model as the final protein structure prediction. All the methods described have been released as open source software, and are freely available to the scientific community.
Degree
Ph. D.