Shared more. Cited more. Safe forever.
    • advanced search
    • submit works
    • about
    • help
    • contact us
    • login
    View Item 
    •   MOspace Home
    • University of Missouri-Kansas City
    • School of Graduate Studies (UMKC)
    • Theses and Dissertations (UMKC)
    • Dissertations (UMKC)
    • 2021 Dissertations (UMKC)
    • 2021 UMKC Dissertations - Freely Available Online
    • View Item
    •   MOspace Home
    • University of Missouri-Kansas City
    • School of Graduate Studies (UMKC)
    • Theses and Dissertations (UMKC)
    • Dissertations (UMKC)
    • 2021 Dissertations (UMKC)
    • 2021 UMKC Dissertations - Freely Available Online
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    advanced searchsubmit worksabouthelpcontact us

    Browse

    All of MOspaceCommunities & CollectionsDate IssuedAuthor/ContributorTitleIdentifierThesis DepartmentThesis AdvisorThesis SemesterThis CollectionDate IssuedAuthor/ContributorTitleIdentifierThesis DepartmentThesis AdvisorThesis Semester

    Statistics

    Most Popular ItemsStatistics by CountryMost Popular AuthorsStatistics by Referrer

    Second chance competitive autoencoders for understanding textual data

    Goudarzvand, Saria
    View/Open
    [PDF] Second chance competitive autoencoders for understanding textual data (9.054Mb)
    Date
    2021
    Metadata
    [+] Show full item record
    Abstract
    Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications, and fiction. To keep track of this data, there are categories, keywords, tags, or labels that are assigned to each text. Dimensionality reduction and topic modeling in Mining text data has received a lot of attention. Topic modeling is a statistical technique for revealing the underlying semantic structure in a large collection of documents. Applying conventional autoencoders on textual data often results in learning trivial and redundant representations due to high text dimensionality, sparsity, and following power-law word distribution. To address these challenges, we introduce three novel autoencoders, SCAT (Second Chance Autoencoder for Text), SSCAT (Similarity-based SCAT), and CSCAT (Coherent-based SCAT). Our autoencoders utilize competitive learning among the k winner neurons in the bottleneck layer, which become specialized in recognizing specific patterns, leading to learning more semantically meaningful representations of textual data. In addition, the SSCAT model presents a novel competition based on a similarity measurement to eliminate redundant features. Our experiments prove that SCAT, SSCAT, and CSCAT achieve high performance on several tasks, including classification, topic modeling, compared to LDA, k-Sparse, KATE, NVCTM, ZeroShotTM, and ProdLDA. Additionally, the proposed models are simpler and faster than the established approaches. This work contributes: (1) SCAT autoencoder utilizes the idea of k-competitive learning among the strongest and weakest, positive, and negative neurons in the bottleneck layer. The novelty stems from involving the weakest neurons in the competition process, which might hold meaningful representations but receive low activation values due to random initialization or being representative of rare words or topics. (2) SSCAT autoencoder presents the novel idea of a similarity-based criterion for selecting neurons that are eligible to enter the learning competition provided by the SCAT approach. This process prevents neurons from a high-similarity score to more than k/2 other neurons from entering the competition. We hypothesize that eliminating redundant features will result in better topic representation. (3)CSCAT autoencoder applies the coherent score for selecting the eligible neurons. In this approach, we eliminate neurons in which the highest features do not hold a high coherent score. (4) A thorough evaluation of our autoencoders compared to KATE, k-Sparse, LDA, and NVCTM. The evaluation includes topic modeling, topic coherence score, and document classification using the datasets: 20 Newsgroups, Wiki10+, and Reuters.
    Table of Contents
    Introduction -- Similarity-based second chance autoencoder for textural data -- Coherence-based second chance autoencoders for document understanding -- Personality trait identification: challenges and obstacles -- Public discourse about the opioid crisis on twitter, 2010-2019 -- Early temporal characteristics of elderly patient cognitive impairment in electronic health records -- Mining news media for understanding public health records -- Conclusion -- Appendix
    URI
    https://hdl.handle.net/10355/89578
    Degree
    Ph.D. (Doctor of Philosophy)
    Thesis Department
    Computer Science (UMKC)
     
    Electrical Engineering (UMKC)
     
    Collections
    • Computer Science and Electrical Engineering Electronic Theses and Dissertations (UMKC)
    • 2021 UMKC Dissertations - Freely Available Online

    If you encounter harmful or offensive content or language on this site please email us at harmfulcontent@umkc.edu. To learn more read our Harmful Content in Library and Archives Collections Policy.

    Send Feedback
    hosted by University of Missouri Library Systems
     

     


    If you encounter harmful or offensive content or language on this site please email us at harmfulcontent@umkc.edu. To learn more read our Harmful Content in Library and Archives Collections Policy.

    Send Feedback
    hosted by University of Missouri Library Systems