A new filtering index for fast processing of SPARQL queries
Metadata[+] Show full item record
The Resource Description Framework (RDF) has become a popular data model for representing data on the Web. Using RDF, any assertion can be represented as a (subject, predicate, object) triple. Essentially, RDF datasets can be viewed as directed, labeled graphs. Queries on RDF data are written using the SPARQL query language and contain basic graph patterns (BGPs). We present a new filtering index and query processing technique for processing large BGPs in SPARQL queries. Our approach called RIS treats RDF graphs as "first-class citizens." Unlike previous scalable approaches that store RDF data as triples in an RDBMS and process SPARQL queries by executing appropriate SQL queries, RIS aims to speed up query processing by reducing the processing cost of join operations. In RIS, RDF graphs are mapped into signatures, which are multisets. These signatures are grouped based on a similarity metric and indexed using Counting Bloom Filters. During query processing, the Counting Bloom Filters are checked to filter out non-matches, and finally the candidates are verified using Apache Jena. The filtering step prunes away a large portion of the dataset and results in faster processing of queries. We have conducted an in-depth performance evaluation using the Lehigh University Benchmark (LUBM) dataset and SPARQL queries containing large BGPs. We compared RIS with RDF-3X, which is a state-of-the-art scalable RDF querying engine that uses an RDBMS. RIS can significantly outperform RDF-3X in terms of total execution time for the tested dataset and queries.
Table of Contents
Introduction -- Motivation and related work -- Background -- Bloom filters and Bloom counters -- System architecture -- Signature tree generation -- Querying the signature tree -- Evaluation -- Experiments -- Conclusion