Explainable cohort discoveries driven by exploratory data mining and efficient risk pattern detection
No Thumbnail Available
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
Finding small homogeneous subgroup cohorts in a large heterogeneous population is a critical process for hypothesis development within a broad range of applications, such as fraud detection, ad targeting, and geospatial traffic intervention. Most recently, cohort discovery has begun to play an important role in medical research as it has contributed to the targeting of high-need patients from smaller homogeneous subgroups for precision heath with better outcomes. Specifically, there has been a rising demand to identify the cohorts and the corresponding risk factors in precision medicine and preventive healthcare to better understand the etiology of diseases in order to tailor treatments for targeting patients. There is a clear need to discover the novel cohorts and the risk factors in the abovementioned application areas. Unfortunately, concurrent computational approaches are still lacking robust answers to the question: "which subgroups are likely to be novel and may benefit from interventions that are likely to be effective for the selected population?" Additionally, the majority of prevention research has focused on single or simple factor identification. Only a few studies have considered complex risk factors, and they are still at a preliminary stage. The development of machine learning and data mining algorithms sheds light on many areas. However, most high-performing approaches do not provide the interpretability for eXplainable artificial intelligence (XAI). These black box approaches often provide a predictive analytic capability to determine which class samples belong to. This supervised classification task requires pre-set labels in the data instead of exploring the sub-clusters. There is a need to develop innovative, data-driven, explainable cohort discovery approaches. To bridge the knowledge gap, we developed a novel subgroup discovery method which employs a deep exploratory mining process to slice and dice thousands of potential subpopulations and prioritize potential cohorts based on their explainable contrast patterns. Computational experiments were conducted on both synthesized data and a clinical autism dataset to assess performance quantitatively for coverage of pre-defined cohorts and qualitatively for novel knowledge discovery, respectively. Furthermore, scaling analysis was conducted using a distributed computing environment to suggest computational resource needs when there is an increase in subpopulation number. To address the limitation of current risk factors identification approaches, we further created a novel dynamic tree structure, Risk Hierarchical Pattern Tree (RHPTree), and a top-down search method, RHPSearch, which are both capable of efficiently analyzing a large volume of data. We also introduced two specialized search methods, the extended target search (RHPSearch-TS) and the parallel search approach (RHPSearch-SD), to further speed up the retrieval of certain items of interest. Experiments on both benchmark datasets and real-world data demonstrate that our method is not only faster but also more effective in identifying comprehensive long risk patterns than existing works. To further address real-world applications of computational work in biomedicine, we developed a multi-layer, unbiased cohort discovery architecture to provide the broad biomedical research community with a computational tool that offers capabilities beyond what traditional unsupervised cohort discovery methods, such as latent class analysis, can achieve. Experiments were conducted on both synthetic datasets and a clinical type 1 diabetes (T1D) dataset to assess the efficiency and discovery capability of the method. The high coverage, fast speed, and novel findings on the datasets demonstrate that our method is robust and feasible for cohort discovery research. The computational contributions in this dissertation work lay a foundation for eXplainable and actionable artificial intelligence (X2AI) with multiple successful applications in cancer drug repositioning, type 1 diabetes studies, environmental impacts on liver cancer, and the impact of the COVID-19 pandemic.
Table of Contents
DOI
PubMed ID
Degree
Ph. D.
