Scalable knowledge graph construction and inference on human genome variants

No Thumbnail Available

Meeting name

Sponsors

Date

Journal Title

Format

Thesis

Subject

Research Projects

Organizational Units

Journal Issue

Abstract

Harnessing the unlimited potential of RNA-sequencing data requires efficient and scalable solutions. In this research, we explore the power of knowledge graphs to analyze variant-level information extracted from RNA-sequences of COVID-19 patients. We propose a unified knowledge graph specifically tailored to the real-world relationships amongst entities obtained from Variant Call Format (VCF) files, annotated VCF files using the SnpEff tool, and associated CADD scores files while also maintaining a level of generality that permits future extrapolation and expansion. Utilizing available graph storage such as Blazegraph, we establish a large, scalable knowledge graph containing billions of nodes and edges that not only enables efficient storage, querying, and dataset creation but also serves as a foundation for several downstream tasks. To demonstrate its versatility, we present two compelling case studies using two different subsets of the graph, one containing 73 million nodes and 112 million edges and the other graph containing 1 million nodes and 2 million edges and employ graph machine learning techniques for node classification tasks. Several challenges, such as efficient scalability, capturing real-world knowledge at a large scale, and attaining a Graph Neural Network (GNN) with good performance and evaluation metrics, have been considered and dealt with. We compare two GNN models, GraphSAGE and Graph Convolutional Network (GCN), to identify the range of raw scores it can belong to and the GraphSAGE model to identify the deleteriousness of a variant using different model hyperparameter settings. Through the experiments conducted, we demonstrate that neighborhood knowledge can tremendously aid the model in learning and that the GraphSAGE model outperforms GCN for the CADD category node classification task. By seamlessly integrating a variety of RNA-seq genomic information into a coherent knowledge graph framework, our approach not only addresses the immediate challenges of accessing and parsing variant-level analysis but also lays the groundwork for future advancements using knowledge graphs in genomic research. Finally, we demonstrate the workings of our tool named VariantKG and present usage scenarios, namely, graph enrichment, graph creation, and graph machine learning and inference.

Table of Contents

DOI

PubMed ID

Degree

Ph. D.

Thesis Department

Rights

License