Addressing the scarcity of high-quality biomedical data for machine learning through complexity minimization strategies
No Thumbnail Available
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
The exponential growth of biomedical data hides a persistent challenge for biomedical informaticians pursuing machine learning research, the shortage of high-quality, well-integrated data required to train advanced machine learning models. This scarcity stems from several factors, including the high cost of data collection, the heterogeneity resulting from varied data collection and recording practices, and patient privacy regulations that restrict data aggregation. The consequence is a disconnected set of small, independent datasets which cannot typically be combined. Although larger collections, such as aggregated Electronic Health Records (EHRs), exist, their utility is hampered by inconsistent, incomplete, and incorrect data that reflect their primary operational use rather than research standards. As a result, researchers are often forced to work with small, but high-quality research datasets when training advanced machine learning models. To address the challenges associated with the small size of many biomedical datasets, I developed Adaptive Complexity Deep Neural Networks (ACDNNs), an architecture designed to explicitly minimize model complexity during training. This design philosophy drives the network to discover the simplest predictive rules necessary for accurate classification, enabling ACDNNs to outperform conventional deep learning techniques in the context of small yet complex biomedical datasets. Additionally, the inherent complexity reduction in ACDNNs enhances their ability to identify significant predictive features. I apply this property to uncover protective genetic variants associated with Autism Spectrum Disorder (ASD) within a complex, non-linear polygenic framework. In addition, I propose Random Order AutoRegressive (ROAR) models to tackle the integration of heterogeneous multi-modal biomedical data. Instead of reducing model complexity, ROAR reduces the complexity of its input data by encoding it into a hierarchical binary representation. Each added bit in this hierarchy attempts to encode the maximum amount of residual semantic information, naturally clustering semantically similar inputs. This strategy proves particularly effective in the medical domain by identifying homogenous patient subgroups of potential clinical relevance. Additionally, ROAR demonstrates significant generative capabilities, allowing new samples to be generated using identified clusters as "prompts". I demonstrate this capability by using a pre-trained model to synthesize novel drug molecules with targeted properties without the need for task-specific retraining. This work presents two novel solutions, ACDNNs and ROAR models, that not only address key limitations in current biomedical machine learning practices but also pave the way for more robust, interpretable, and efficient data-driven discoveries in medicine. These approaches promise to enhance researchers' ability to leverage sparse, heterogeneous biomedical data, accelerating the translation of computational findings into tangible clinical insights and therapeutic advancements.
Table of Contents
DOI
PubMed ID
Degree
Ph. D.
