Parallel gene upstream comparison via multi-level hash tables on GPU
Abstract
The region of DNA immediately in front of a gene body (also called upstream region) contains short (8-20 base) sequence motifs that help to control when that gene is turned on and off. Unfortunately, these motifs are generally unknown and commonly degenerate. In this work, a motif-finding framework is proposed that, given a set of gene upstream regions, performs all-to-all pairwise comparison and identifies all motifs of length k (k-mers) that are common to any pair of upstream regions or differ in at most d characters. This framework stores the k-mers found in each gene in a multi-level hash table. The hash table design optimizes hash table comparison (rather than hash table insertion or lookup), is highly parallelizable and easily maps onto GPU. Four GPU kernels are proposed for pairwise hash table comparison, each leveraging a distinct parallelization approach. A study of different factors that affect the performance of the implementations is included (the hash function, the number of buckets and the settings of additional implementation-specific parameters). Experimental results performed using an average-size yeast genome show that our fastest GPU kernel outperforms an 8-thread, cache-efficient CPU implementation by a factor of [about]52x.
Degree
M.S.
Thesis Department
Rights
OpenAccess.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.