Classification of Clinical Tweets Using Apache Mahout
Metadata[+] Show full item record
There is an increasing amount of healthcare related data available on Twitter. Due to Twitter’s popularity, every day large amount of clinical tweets are posted on this microblogging service platform. One interesting problem we face today is the classification of clinical tweets so that the classified tweets can be readily consumed by new healthcare applications. While there are several tools available to classify small datasets, the size of Twitter data demands new tools and techniques for fast and accurate classification. Motivated by these reasons, we propose a new tool called Clinical Tweets Classifier (CTC) to enable scalable classification of clinical content on Twitter. CTC uses Apache Mahout, and in addition to keywords and hashtags in the tweets, it also leverages the SNOMED CT clinical terminology and a new tweet influence scoring scheme to construct high accuracy models for classification. CTC uses the Naïve Bayes algorithm. We trained four models based on different feature sets such as hashtags, keywords, clinical terms from SNOMED CT, and so on. We selected the training and test datasets based on the influence score of the tweets. We validated the accuracy of these models using a large number of tweets. Our results show that using SNOMET CT terms and a training dataset with more influential tweets, yields the most accurate model for classification. We also tested the scalability of CTC using 100 million tweets in a small cluster.
Table of Contents
Introduction -- Background and related work -- Design and framework -- Evaluation -- Conclusion and future work