SigSpace – Class-Based Feature Representation for Scalable and Distributed Machine Learning
Metadata[+] Show full item record
In the era of big data, it is essential to explore the opportunities in discovering knowledge from big data. However, traditional machine learning approaches are not well fit to analyze the full value of big data. Explicitly, current research and practice of Machine learning do not fully support some important features for big data analytics such as incremental learning, distributed learning, and fuzzy matching. In this thesis, we propose a unique feature representation, named the SigSpace. It is designed for a class-level incremental learning in support to distributed learning and fuzzy matching. In SigSpace, a class-based model was built by an evaluation and extension of existing machine learning models, i.e., K-means and Self-Organizing Maps (SOM). The Machine learning with SigSpace is modeled as a feature set with standard machine learning algorithms like Random Forests, Decision Tree etc., and a class model using L1 (Manhattan distance) and L2 (Euclidean distance) norms. iii In order to provide supporting evidence for the effectiveness of SigSpace, we have conducted comprehensive experiments as follows: Firstly, multiple experiments were conducted to evaluate the SigSpace model in image classification using large scale image datasets including Caltech-101, Caltech-256, ImageNet, UEC FOOD 256, MNIST with image features like Pixels, SIFT, and Local Binary Pattern. Secondly, SigSpace was evaluated in the audio classification context with imperative audio features extracted from real-time audio datasets. The SigSpace system was implanted using a Big data analytics tool, Apache Spark(MLLib) with the capability of parallel and distributed learning and recognition. The experiments of multinomial classification were conducted with 6 to 1000 classes, space requirements in megabytes to terabytes, and learning time ranging from minutes to days. Although there has been a slight accuracy decrease (approximately 5%) in the overall performance, SigSpace is very efficient, in terms of space as well as runtime performance for learning and recognition. Thus, the current evaluation confirms that SigSpace has a significant approach for distributed and scalable Machine learning with big data.
Table of Contents
Introduction -- Background and related work -- Proposed solution: SigSpace -- Implementation and evaluation -- Conclusion and future work