Deep learning for protein function prediction
Loading...
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
Proteins are indispensable macromolecules that drive essential biological processes, including catalyzing chemical reactions, regulating gene expression, and transmitting molecular signals. Understanding protein function is key to unraveling the complex mechanisms that govern living systems. However, experimental approaches for determining protein function are often slow, costly, and low-throughput. At the same time, rapid advances in genome and transcriptome sequencing have produced a massive volume of protein sequence data, far outpacing the rate of experimental functional annotation. This growing disparity underscores the need for accurate and scalable computational methods to predict protein function. This dissertation presents a suite of deep learning-based computational approaches aimed at addressing the challenges of automated protein function prediction (AFP). These methods leverage a diverse range of protein information, including amino acid sequences, 3D structural features, domain annotations, textual descriptions, and ontology definitions. By combining these sources, we develop multimodal deep learning models that enable robust and generalizable function prediction. First, I introduce TransFun, a model that combines protein sequence and structure data using graph-based learning. Next, I extend this approach with TransFew, which leverages large pre-trained language models and semantic representations of Gene Ontology (GO) terms to enhance predictions, particularly for underrepresented functions. Finally, I present FunBind, a flexible multimodal framework capable of integrating multiple biological data types for robust and generalizable protein function prediction. The primary contributions of this work focus on two key challenges in AFP: effectively integrating multimodal protein data, and accurately predicting functions for proteins that are sparsely annotated, particularly in few-shot or zero-shot settings. In addition to proposing novel models, this work discusses practical challenges in multimodal learning for biology, strategies for combining heterogeneous data sources, and evaluation protocols that better reflect real-world prediction scenarios. The methods developed in this dissertation demonstrate significant improvements in both the accuracy and coverage of protein function prediction. These contributions ultimately enhance our understanding of protein roles across diverse biological systems. All tools and datasets developed through this research are publicly available to support further scientific progress in the field.
Table of Contents
PubMed ID
Degree
Ph. D
