Parallel SPARQL Query Execution using Apache Spark
Semantic Web technologies such as Resource Description Framework (RDF) and SPARQL are increasingly being adopted by applications on the Web, as well as in domains such as healthcare, finance, and national security and intelligence. While we have witnessed an era of many different techniques for RDF indexing and SPARQL query processing, the rapid growth in the size of RDF knowledge bases demands scalable techniques that can leverage the power of cluster computing. Big data ecosystems like Apache Spark provide new opportunities for designing scalable RDF indexing and query processing techniques. In this thesis, we present new ideas on storing, indexing, and query processing of RDF datasets with billions of RDF statements. In our approach, we will leverage Resilient Distributed Datasets (RDDs) and MapReduce in Spark and the graph processing capability of GraphX. The key idea is to partition the RDF dataset, build indexes on the partitions, and execute a query in parallel on the collection of indexes. A key theme of our design is to enable in-memory processing of the indexes for fast query processing.
Table of Contents
Introduction -- Background and motivation -- Proposed approach -- Implementation of the system -- Performance evaluation -- Conclusion and future work