A study of effectiveness of simulated data in classifying Alzheimer's disease status using MRS parameters
Abstract
In vivo magnetic resonance spectroscopy, or MRS, has the potential to identify meaningful differences in the neurochemical profiles of patients with Alzheimer's disease (AD) relative to healthy controls, especially at ultra-high field (7 Tesla). Classification algorithms applied to such data could, theoretically, aid in disease diagnosis or act as an indicator of the effectiveness of a treatment. A common limitation when applying classification algorithms to such data is sample size, which arises from difficulty in recruitment of individuals with AD. Classification algorithms applied to small datasets may benefit from additional training data simulated from Bayesian networks that are learned via hill-climbing. Ultra-high field single voxel MRS and MRI data from Marjanska et al (2019) were used to explore the effect that including simulated data in the training of classification algorithms has on the ability to correctly classify AD status [2]. Three hill climbing methods, hill climbing scored via the Bayesian information criterion (BIC), hill climbing scored via the Akaike information criterion (AIC), and max-min hill climbing scored via BIC, were tested using the original data and data in which each Alzheimer's observation is doubled in search for the network that could produce the highest posterior predictive correlation across ascorbate, percent grey matter, and signal to noise ratio. The effect of including data simulated in the resulting manner was then characterized across three classification algorithms (Extreme Gradient Boosting, random forest, and support vector machine) and three data sets (original AD and control data from Marjanska et al (2019), Marjanska et al (2019) data reflected as principal components, and Marjanska et al (2019) data with additional controls from The Lifespan Human Connectome Project in Aging used in the training and simulation process) [3]. Variables used to predict AD status were mostly from MRS derived data, consisting of 14 water-referenced neurochemical concentrations, signal to noise ratio, and linewidth, with few variables from MRI derived data, consisting of percent gray matter, percent white matter, percent cerebro-spinal fluid relative to the volume of interest. The strongest findings regarding structure learning arose from hill climbing scored via BIC using data in which every AD observation was over-sampled. The inclusion of data simulated from Bayesian networks whose structures were derived via this method did not widely lead to higher average sensitivity, specificity, or overall accuracy.
Degree
M.S.