SigsSpace-Text: Parallel and Distributed Signature Learning in Text Analytics
Abstract
Big data analytics uncover hidden patterns and useful information from big data. It is a complex and time-consuming process. Recent advancements in parallel and distributed approaches have led to the evolution of big data analytics. It also claimed bigger data may not always be better data. Toward scalable solutions for big data analytics, it is highly demanded to have a scalable and dynamic process with more representative and relevant sets of data. We envision that if the condensed and representative sample can be drawn from very large-scale datasets in a parallel and distributed manner and this can be defined as signature learning, this approach can provide more accurate results in an efficient manner. Using signature learning with relevant datasets in a parallel and distributed manner, the complexity of big data problems can be reduced.
In this thesis, we propose the SigSpace-Text framework that is an extension of our previous model of signature-based learning (SigSpace) that proved the effectiveness of signature-based classification with image signatures and audio signatures. SigSpace was not feasible with text data due to the inherent problems in the text domain such as a high-dimensional feature space and sparse feature vectors. In order to handle these issues, we explore using Natural Language Processing, that features extraction and feature selection techniques (TFIDF, Word2Vec). Signature learning in SigSpace-Text is based on a class-level clustering approach, in which a generic pattern is identified for a given category using state-of-the-art clustering algorithms, i.e., K-Means, Self-Organizing Maps (SOM), and Gaussian Mixture Models (GMM). These signatures are used (instead of raw data) as a feature set to the classification. Through extension, the proposed SigSpace-Text approach brings vital, practical information to signature learning approaches on several text classification tasks. The SigSpace-Text model supports incremental, distributed, and parallel learning using big data analytics including Apache Spark and the Machine Learning library such as Spark MLlib. In experiments with the SigSpace-Text framework, the effectiveness of the proposed signature learning model was evaluated for various parameters (such as the signature size, classification algorithms, local signatures/global signatures) and was also validated with a number of classification algorithms (i.e., Naïve Bayes, Decision Trees, and Random Forests) using 20 newsgroup dataset. Based on these observations, we identify that SigSpace-Text outperforms state-of-the-art performance results on the dataset.
Table of Contents
Introduction -- Background and related work -- Proposed solution -- Implementation -- Results and evaluation -- Conclusion and future work
Degree
M.S.