Machine and Deep Learning Approach for Type 2 Diabetes Prediction Using the CDC’s BRFSS Dataset: A Retrospective Analysis
Metadata[+] Show full item record
Type 2 diabetes mellitus (T2DM) is a complex metabolic disease which is characterized by persistent hyperglycemia caused by insulin resistance. It is the most prevalent type of diabetes mellitus (DM). T2DM presents a heterogenous etiology with social, environmental, behavioral, and genetic risk factors. It is associated with serious microvascular and macrovascular complications which are also associated with increased morbidity, mortality, and health expenditure. However, early detection, lifestyle changes and treatment may prevent or delay the onset of associated long-term complications. This study used the 2020 Behavioral Risk Factor Surveillance System (BRFSS) dataset to train different machine learning (ML) and neural network or multilayer perceptron classifier (NN) model(s) and test their performance on predicting the risk for T2DM. A copy of the dataset was transformed to have balanced classes in the outcome variable to allow further comparison of performance for each predictive model when trained with either the original or transformed dataset. A cross-sectional data analysis using chi-square was employed to investigate the association of selected predictors or risk factors with T2DM. Metrics used to assess model performance included accuracy, area under the curve-receiver operating characteristics (ROC-AUC), precision, recall, and F1-score. When models were trained on the original train dataset (data with significant outcome variable class imbalance), accuracy ranged from 71.6% to 81%, ROC-AUC from 0.57 to 0.75, precision from 0% to 55.7%, recall from 0% to 38.3%, and F1-score from 0% to 38%. ROC-AUC for Decision Tree Classifier (DT) was 0.57, K-Nearest Neighbors Classifier (KNN) was 0.65, and Support Vector Classifier (SVC) was 0.68 which interpreted to a failed or poor predictive models. But these models had satisfactory or good accuracy. Training models on the original train dataset caused models to overfit the majority class. Thus, they had poor recall or sensitivity, precision and F1-score values which are crucial in detecting positive, false positives and false negative classes for T2DM. Also, time it took a model to train on training data and score on test data was evaluated and SVC had the longest times for both training and scoring while NN model took long to train but was faster to score. When models were trained on transformed data (data with balanced outcome variable classes), accuracy ranged from 66.7% to 82.5%, ROC-AUC from 0.73 to 0.91, precision from 66.9% to 79.7%, recall from 66.4% to 92.1%, and F1-score from 66.5% to 83.2%. This comparison clearly showed Random Forest Classifier (RF) to be the best performing model with consistently good and excellent fit across all metrics (accuracy: 82.5%, ROC-AUC: 0.91, precision: 79.7%, recall: 87.0%, and F1-score: 83.2%). Gaussian Naïve Bayes classifier (GNB) had the poorest fit across all metrics. Again, SVC was the worst model time wise. All models showed significant increase in recall, precision and F1-score values suggesting that significant outcome class imbalance has a negative effect on all models. RF, KNN, and DT had F1-score values of 83.2%, 80.9%, and 78.7%, recall values of 87.0%, 92.1% and, 83.0% and precision values of 79.7%, 72.2%, and 74.7%, respectively. Of all models, RF, KNN, and DT showed high performance across all metrics. KNN had the fastest training but longest testing time, RF and DT slightly slower train and fast testing time. These models are good candidates for initial T2DM screening, but RF is the model of choice.
Table of Contents
Introduction -- Review of literature -- Methodology -- Results -- Conclusions/Discussion -- Appendix
M.S. (Master of Science)