Performance evaluation of text augmentation methods with BERT on imbalanced datasets

Hu, Lingshu

Performance evaluation of text augmentation methods with BERT on imbalanced datasets

Files

HuLingshuResearch.pdf (2.66 MB)

Authors

Hu, Lingshu

Date

2022

Format

Thesis

Abstract

Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of human annotation, limiting the performance of deep learning classifiers. Therefore, this study examines the effectiveness of Word2Vec and WordNet augmentation methods with BERT fine-tuning on datasets of various sizes (e.g., 500, 1,000, and 5,000 training documents) and imbalance ratios (e.g., 4:1 and 9:1). It compares them with other methods for imbalanced data, including boosting, SMOTE, and simple oversampling, combined with widely used machine learning models, including logistic regression, fully connected neural network, and LSTM. Experimental results show that Word2Vec augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (9 percent-30 percent recall increase compared to the base model and 11 percent-12 percent recall increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the Word2Vec augmentation becomes smaller or insignificant. Moreover, Word2Vec augmentation plus BERT achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.

URI

https://hdl.handle.net/10355/91718
https://doi.org/10.32469/10355/91718

Degree

M.S.

Thesis Department

Computer science (MU)

Collections

2022 MU Theses - Freely available online
Computer Science electronic theses and dissertations (MU)

Full item page

Performance evaluation of text augmentation methods with BERT on imbalanced datasets

Files

Authors

Meeting name

Sponsors

Date

Journal Title

Format

Subject

Research Projects

Organizational Units

Journal Issue

Abstract

Table of Contents

URI

DOI

PubMed ID

Degree

Thesis Department

Rights

License

Collections