Semi-automatic exploratory data analytics for actionable discoveries through subgroup mining
No Thumbnail Available
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
People are born with the curiosity to see differences between groups. These differences are useful for understanding the root causes of certain discrepancies, such as populations and diseases. However, without prior knowledge of the data, it is extremely challenging to identify which groups differ most, let alone to discover what associations contribute to the differences. The challenges are mainly from the large searching space with complex data structure, as well as the lack of efficient quantitative measurements that are closely related to the meaning the differences. To tackle these issues, we developed a novel exploratory data mining method to identify ranked subgroups that are highly contrasted for further in-depth analyses. The underpinning components of this method include (1) a semi-greedy forward floating selection algorithm to reduce the search space, (2) a deep-exploring approach to aggregate a collection of sizable and creditable candidate feature sets for subgroups identification using in-memory computing techniques, (3) a G-index contrast measurement to guide the exploratory process and to evaluate the patterns of subgroup pairs, and (4) a ranking method to provide mined results from highly contrasted subgroups. Computational experiments were conducted on both synthesized and real data. The algorithm performed adequately in recognizing known subgroups and discovering new and unexpected subgroups. This exploratory data analysis method will provide a new paradigm to select data-driven hypotheses that will produce potentially successful actionable outcomes to tailor to subpopulations of individuals, such as consumers in E-commerce and patients in clinical trials.
Table of Contents
DOI
PubMed ID
Degree
M.S.
Thesis Department
Rights
OpenAccess.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.
