Natural language models for protein predictions using anti-CRISPR as a case
No Thumbnail Available
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
Acr proteins (Acrs) can inactivate the bacterial host's CRISPR-Cas defense system and promote bacteriophage infection. The discovery of Acr proteins creates a promising system to precisely control the CRISPR-Cas machinery for the development of CRISPRCas based biotechnological tools, i.e., gene editing. These proteins are widespread amongst bacteriophages and past studies have identified several Acrs, but a challenge still exists to comprehensively identify Acrs accurately and efficiently from genome and metagenome sequence data. Here, we have developed two deep-learning-based predictors to accurately identify Acrs from protein datasets derived from genome and metagenome sequencing projects. One is an attention-based supervised learning model with an ensemble framework. The other is a self-supervised learning model, which uses Variance, Invariance and Covariance regularization terms in the loss function to explicitly achieve outstanding feature extraction goals using discriminative learning. The input to both models is fullycollected and well-designed protein sequences. Extensive cross-validation and independent tests show that both of them achieve more accurate performance compared with homologybased baseline predictors and the existing toolkit.
Table of Contents
DOI
PubMed ID
Degree
M.S.
