Evaluation of machine learning algorithms for prediction of 5-year survivability for seven types of cancers

Kotapati, Teja Venkat Pavan

Kotapati, Teja Venkat Pavan

View/Open

KotapatiTejaResearch.pdf (1.379Mb)

Date

2020

Format

Thesis

Metadata

[+] Show full item record

Abstract

A lot of research on prediction of cancer survivability has been done by implementing various machine learning models and it has always been a challenging task. In this project, the main focus is to perform a comprehensive evaluation of machine learning models across multiple cancer cohorts and find the models with better prediction capability. Class balancing techniques like oversampling and undersampling were implemented into the models to improve the performance of cancer survival prediction. SEER cancer dataset (1973-2015) was used for this project. After preprocessing, we included a total of 21 independent variables and a dependent variable. Multiple machine learning models like Decision Trees, Logistic Regression, Naive Bayes, Support Vector Machine, Random Forests and Multi-Layer Perceptron were implemented. Bias between training and testing data was eliminated by implementing stratified 10-fold crossvalidation. The experimental design was in such a way that all the machine learning models were implemented across seven cancer cohorts using all eligible records each cohort as well as using two sampling techniques for class balancing. Performance of the machine learning models were compared based on the metrics like Sensitivity, Accuracy, Specificity, Precision, F1 score and AUC scores. A total of 168 experimental models were designed and implemented. Comparison between the predictive models showed that Random Forests have best predicted for cancer survivability, Support Vector Machine came as second-best predictors, Logistic Regression as third, then Decision Trees, Multi-Layer Perceptron and lastly Naive Bayes with least performance. The results clearly indicated that implementing class balancing techniques also improved the performance of the models significantly.

URI

https://hdl.handle.net/10355/86544
https://doi.org/10.32469/10355/86544

Degree

M.S.

Thesis Department

Health informatics program (MU)

Rights

OpenAccess.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. Copyright held by author.