Classification of Clinical Tweets Using Apache Mahout

Wang, Li

Wang, Li

View/Open

Classification of Clinical Tweets Using Apache Mahout (8.036Mb)

Date

2015

Format

Thesis

Metadata

[+] Show full item record

Abstract

There is an increasing amount of healthcare related data available on Twitter. Due to Twitter’s popularity, every day large amount of clinical tweets are posted on this microblogging service platform. One interesting problem we face today is the classification of clinical tweets so that the classified tweets can be readily consumed by new healthcare applications. While there are several tools available to classify small datasets, the size of Twitter data demands new tools and techniques for fast and accurate classification. Motivated by these reasons, we propose a new tool called Clinical Tweets Classifier (CTC) to enable scalable classification of clinical content on Twitter. CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets, it also leverages the SNOMED CT clinical terminology and a new tweet influence scoring scheme to construct high accuracy models for classification. CTC uses the Naïve Bayes algorithm. We trained four models based on different feature sets such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We selected the training and test datasets based on the influence score of the tweets. We validated the accuracy of these models using a large number of tweets. Our results show that using SNOMET CT terms and a training dataset with more influential tweets, yields the most accurate model for classification. We also tested the scalability of CTC using 100 million tweets in a small cluster.

Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future work

URI

https://hdl.handle.net/10355/46336

Degree

M.S.

Thesis Department

Computer Science (UMKC)