Distributed association rule mining using an in-memory cluster computer framework
Abstract
[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Association rule mining is a mature area; however, there is much to be done to take full advantage of the massive distributed computing environments that are available today. Association rule mining is a key data mining technique used in a wide range of application domains. A variety of algorithms have been proposed over the past two decades to efficiently identify frequent patterns. In general, these methods fall into two broad categories: Apriori-based and growth-based. Since the number of potential frequent itemsets is exponential with respect to the number of distinct items, the resources required for computation can quickly exceed the capacity of a single machine. Hence, we must consider distributed approaches to extend our reach. In this thesis, we will discuss the algorithms used in an association rule mining pipeline designed to take advantage of in-memory cluster computing environments. We propose several mechanisms to tailor the Apriori algorithm to distributed computing ecosystems and evaluate the scalability of our approach. We demonstrate significant improvements over an existing frequent pattern mining method, achieving nearly 1000 times speed up on certain datasets. Our proposed association rule mining package provides modularity to promote extensibility and flexibility in terms of rule extraction, filtering, and analysis.
Degree
M.S.
Thesis Department
Rights
Access to files is limited to the University of Missouri--Columbia.