Applying deep learning technologies to discovery and characterization of genetic variants in animal genomes
Abstract
[EMBARGOED UNTIL 12/1/2024] The scale of publicly available genomic data has rapidly expanded by an order of magnitude within the last decade. However, research with these aggregated public data requires minimizing the accumulated systematic bias introduced by software and sequencing platforms. Even a tiny improvement in the error rate has drastic implications for research using billions of genomic variants. An incomplete understanding of error plateaued further advancement, but we can now account for known and unknown errors with genomics-specific, deep-neural networks. However, they often contain assumptions based on the human genome. Here, we used bovine genomes to investigate the viability of two deep-learning, short-read sequencing variant callers in animals. First, we develop best practices to re-train DeepVariant to improve short variant calling accuracy across species, demonstrating the importance of curating high-quality training labels, which most animal species lack. Next, we explored the reliability of Cue in bovine genomes, finding that the resulting variants required less manual curation for selecting reliable consensus calls. This research builds the foundation for the animal genomics community to adopt new technologies that will rapidly accelerate our comparative knowledge of the genome.
Degree
Ph. D.