Feature-based Analysis for Open Source using Big Data Analytics
The open source code base has increased enormously and hence understanding the functionality of the projects has become extremely difficult. The existing approaches of feature discovery that aim to identify functionality are typically semi-automatic and often require human intervention. In this thesis, an innovative framework is proposed for automatic discovery of features and the respective components for any open source project dynamically using Machine Learning. The overall goal of the approach is to create an automated and scalable model which produces accurate results. The initial step is to extract the meta-data and perform pre-processing. The next step is to dynamically discover topics using Latent Dirichlet Allocation and to form components optimally using K-Means. The final step is to discover the features implemented in the components using Term Frequency - Inverse Document Frequency algorithm. This framework is implemented in Spark that is a fast and parallel processing engine for big data analytics. ArchStudio tool is used to visualize the features to class mapping functionality. As a case study, Apache Solr and Apache Hadoop HDFS are used to illustrate the automatic discovery of components and features. We demonstrated the scalabilty and the accuracy of our proposed model compared with a manual evaluation by software architecture experts as a baseline. The accuracy is 85% when compared with the manual evaluation of Apache Solr. In addition, many new features were discovered for both the case studies through the automated framework.
Table of Contents
Introduction -- Background and related work -- Framework of feature-based analysis -- Component identification and feature discovery -- Implementation -- Results and evaluation -- Conclusion and future work