Font Size: a A A

Research On Text Mining Technology With Imbalanced Data Distribution Based On Weighted Representation

Posted on:2022-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:P YueFull Text:PDF
GTID:2518306332974029Subject:Master of Agriculture
Abstract/Summary:PDF Full Text Request
Text mining is a technique for mining important information in text by classifying text data and extracting keywords.Existing text mining methods based on neural network models are all trained in balanced data but ignoring the situation of imbalanced data.This will lead the model to pay more attention to the samples with relatively majority labels but ignore the ones with minority labels during the training process.To address the issue that caused by the imbalanced data distribution,both random sampling and ensemble learning were proposed in this thesis to improve the performance of the existing deep neural models on text mining tasks with imbalanced distribution.A weighted ensemble method is firstly built for text classification.Then,a weighted label embedding method is proposed for name entity recognition.The aforementioned methods reduce the negative effects of data imbalance on the model,enhance the feature embedding,facilitate the training of the neural network model and improve the performance.The specific content is as follows:(1)In this thesis,the up-sampling,down-sampling,data synthesis,cross-validation,weighted cross entropy and model integration methods were used with Google's pretrained language model to model the online commentary texts,by reducing the bias of data imbalance in model training.The experimental results show that the performance of the neural network models which were trained by up-sampling was significantly improved.Meanwhile,the use of pre-trained language model,weighted cross entropy and cross-validation methods further improves the effect of the model.(2)For the problem of imbalanced data distribution in the named entity recognition,this thesis proposed a gating method to weight the label embedding,and integrated it into word embeddings to enhance the semantic representation of the text.Further,a variety of methods for imbalanced data distribution was analyzed and compared,including weighted cross entropy,weighted conditional random field and weighted focal loss.The experimental results show that the weighted loss function can force the model to dynamically adjust the weights of class with different labels during the training process.Thus,it alleviating the negative effects of imbalanced data.Another obvious observation is that the introduce of label embedding can further enhance the performance of the proposed model.
Keywords/Search Tags:Random sampling, Label embedding, Text mining, Weighted loss function
PDF Full Text Request
Related items