Natural language models for protein predictions using anti-CRISPR as a case

No Thumbnail Available

Meeting name

Sponsors

Date

Journal Title

Format

Thesis

Subject

Research Projects

Organizational Units

Journal Issue

Abstract

Acr proteins (Acrs) can inactivate the bacterial host's CRISPR-Cas defense system and promote bacteriophage infection. The discovery of Acr proteins creates a promising system to precisely control the CRISPR-Cas machinery for the development of CRISPRCas based biotechnological tools, i.e., gene editing. These proteins are widespread amongst bacteriophages and past studies have identified several Acrs, but a challenge still exists to comprehensively identify Acrs accurately and efficiently from genome and metagenome sequence data. Here, we have developed two deep-learning-based predictors to accurately identify Acrs from protein datasets derived from genome and metagenome sequencing projects. One is an attention-based supervised learning model with an ensemble framework. The other is a self-supervised learning model, which uses Variance, Invariance and Covariance regularization terms in the loss function to explicitly achieve outstanding feature extraction goals using discriminative learning. The input to both models is fullycollected and well-designed protein sequences. Extensive cross-validation and independent tests show that both of them achieve more accurate performance compared with homologybased baseline predictors and the existing toolkit.

Table of Contents

DOI

PubMed ID

Degree

M.S.

Thesis Department

Rights

License