Research On Text Mining Technology With Imbalanced Data Distribution Based On Weighted Representation

Posted on:2022-04-23

Degree:Master

Type:Thesis

Country:China

Candidate:P Yue

Full Text:PDF

GTID:2518306332974029

Subject:Master of Agriculture

Abstract/Summary:

PDF Full Text Request

Text mining is a technique for mining important information in text by classifying text data and extracting keywords.Existing text mining methods based on neural network models are all trained in balanced data but ignoring the situation of imbalanced data.This will lead the model to pay more attention to the samples with relatively majority labels but ignore the ones with minority labels during the training process.To address the issue that caused by the imbalanced data distribution,both random sampling and ensemble learning were proposed in this thesis to improve the performance of the existing deep neural models on text mining tasks with imbalanced distribution.A weighted ensemble method is firstly built for text classification.Then,a weighted label embedding method is proposed for name entity recognition.The aforementioned methods reduce the negative effects of data imbalance on the model,enhance the feature embedding,facilitate the training of the neural network model and improve the performance.The specific content is as follows:(1)In this thesis,the up-sampling,down-sampling,data synthesis,cross-validation,weighted cross entropy and model integration methods were used with Google's pretrained language model to model the online commentary texts,by reducing the bias of data imbalance in model training.The experimental results show that the performance of the neural network models which were trained by up-sampling was significantly improved.Meanwhile,the use of pre-trained language model,weighted cross entropy and cross-validation methods further improves the effect of the model.(2)For the problem of imbalanced data distribution in the named entity recognition,this thesis proposed a gating method to weight the label embedding,and integrated it into word embeddings to enhance the semantic representation of the text.Further,a variety of methods for imbalanced data distribution was analyzed and compared,including weighted cross entropy,weighted conditional random field and weighted focal loss.The experimental results show that the weighted loss function can force the model to dynamically adjust the weights of class with different labels during the training process.Thus,it alleviating the negative effects of imbalanced data.Another obvious observation is that the introduce of label embedding can further enhance the performance of the proposed model.

Keywords/Search Tags:

Random sampling, Label embedding, Text mining, Weighted loss function

PDF Full Text Request

Related items

1	Applying A Newglobal Loss Function With Fused Multipe Loss Function In People Re-identification Neurals Networks
2	The Application Of Label Embedding In Text Classification
3	Research On Partial Label Loss Function
4	Triplet Loss And Manifold Dimensionality Reduction Based Method For Text-independent Speaker Recognition
5	The Research On Sampling For Data Mining
6	Text Classification Based On Label Embedding And Attention Mechanism
7	Research On Capsule Network Text Classification Algorithm Based On Label Embedding
8	Research On The Essential Technology Of Multi-Label Chinese Text Classification
9	Research On The Multi-label Lassification Methods With The Label Embedding And Structure Information
10	Multi-Label Text Classfication Algorithm Based On Seq2Seq Model