Second chance competitive autoencoders for understanding textual data
Date
2021Metadata
[+] Show full item recordAbstract
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications, and fiction. To keep track of this data, there are categories, keywords, tags, or labels that are assigned to each text. Dimensionality reduction and topic modeling in Mining text data has received a lot of attention. Topic modeling is a statistical technique for revealing the underlying semantic structure in a large collection of documents. Applying conventional autoencoders on textual data often results in learning trivial and redundant representations due to high text dimensionality, sparsity, and following power-law word distribution.
To address these challenges, we introduce three novel autoencoders, SCAT (Second Chance Autoencoder for Text), SSCAT (Similarity-based SCAT), and CSCAT (Coherent-based SCAT). Our autoencoders utilize competitive learning among the k winner neurons in the bottleneck layer, which become specialized in recognizing specific patterns, leading to learning more semantically meaningful representations of textual data. In addition, the SSCAT model presents a novel competition based on a similarity measurement to eliminate redundant features. Our experiments prove that SCAT, SSCAT, and CSCAT achieve high performance on several tasks, including classification, topic modeling, compared to LDA, k-Sparse, KATE, NVCTM, ZeroShotTM, and ProdLDA. Additionally, the proposed models are simpler and faster than the established approaches.
This work contributes:
(1) SCAT autoencoder utilizes the idea of k-competitive learning among the strongest and weakest, positive, and negative neurons in the bottleneck layer. The novelty stems from involving the weakest neurons in the competition process, which might hold meaningful representations but receive low activation values due to random initialization or being representative of rare words or topics.
(2) SSCAT autoencoder presents the novel idea of a similarity-based criterion for selecting neurons that are eligible to enter the learning competition provided by the SCAT approach. This process prevents neurons from a high-similarity score to more than k/2 other neurons from entering the competition. We hypothesize that eliminating redundant features will result in better topic representation.
(3)CSCAT autoencoder applies the coherent score for selecting the eligible neurons. In this approach, we eliminate neurons in which the highest features do not hold a high coherent score.
(4) A thorough evaluation of our autoencoders compared to KATE, k-Sparse, LDA, and NVCTM. The evaluation includes topic modeling, topic coherence score, and document classification using the datasets: 20 Newsgroups, Wiki10+, and Reuters.
Table of Contents
Introduction -- Similarity-based second chance autoencoder for textural data -- Coherence-based second chance autoencoders for document understanding -- Personality trait identification: challenges and obstacles -- Public discourse about the opioid crisis on twitter, 2010-2019 -- Early temporal characteristics of elderly patient cognitive impairment in electronic health records -- Mining news media for understanding public health records -- Conclusion -- Appendix
Degree
Ph.D. (Doctor of Philosophy)