Classification of Clinical Tweets Using Apache Mahout
Abstract
There is an increasing amount of healthcare related data available on
Twitter. Due to Twitter’s popularity, every day large amount of clinical tweets are
posted on this microblogging service platform. One interesting problem we face
today is the classification of clinical tweets so that the classified tweets can be
readily consumed by new healthcare applications. While there are several tools
available to classify small datasets, the size of Twitter data demands new tools and
techniques for fast and accurate classification.
Motivated by these reasons, we propose a new tool called Clinical Tweets
Classifier (CTC) to enable scalable classification of clinical content on Twitter.
CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets,
it also leverages the SNOMED CT clinical terminology and a new tweet influence
scoring scheme to construct high accuracy models for classification. CTC uses the
Naïve Bayes algorithm. We trained four models based on different feature sets
such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We
selected the training and test datasets based on the influence score of the tweets.
We validated the accuracy of these models using a large number of tweets.
Our results show that using SNOMET CT terms and a training dataset with
more influential tweets, yields the most accurate model for classification. We also
tested the scalability of CTC using 100 million tweets in a small cluster.
Table of Contents
Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future work
Degree
M.S.