Font Size: a A A

Research On Multi-label Text Classification For Imbalanced Data

Posted on:2021-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:M R WangFull Text:PDF
GTID:2518306497966489Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multi-label text classification,as an important and challenging task in the field of natural language processing,has been receiving widespread attention.Mining multi-label text information helps to understand the complex semantics of multi-label text.At the same time,the class imbalance in multi-label text data in the real world severely affects the effect of multi-label text classification.Therefore,exploring the multi-label text classification method for imbalanced data has important theoretical value and practical application significance.At present,resampling which contains undersampling and oversampling is an important technique for processing imbalanced data.The undersampling approach is easy to lose important information,and the oversampling approach is easy to lose semantic consistency.When using neural network models to solve multi-label text classification problems,most of the research on training optimization methods are based on specific models or methods,and they are not universal.At the same time,the sequence to sequence generation model is a novel and effective method to solve multi-label text classification problem.However,the Encoder of existing model has insufficient text representation capabilities,and cumulative errors affect the classification results.Aiming at the above problems,the boundary mixed resampling method was used to balance the unbalanced samples.Combineing text label correlation mining,a neural network training optimization method was designed.Finally,a sequence to sequence generation model based on dynamic routing for multi-label text classification was proposed.The main work and results are as follows:1)Research on boundary mixed resampling method for imbalanced dataBased on the high-dimensional characteristics of text data,the matrix model symmetry rate was designed to divide the sample into sparsely distributed boundary regions and densely distributed non-boundary regions.For samples with minority class in the boundary region,an oversampling approach based on multi-granular text augmentation was designed to preserve text semantic consistency.For samples with majority class in non-boundary regions,text clustering based on frequent itemsets was used,and a proportional random undersampling method in the cluster was designed to avoid losing important information as much as possible,thereby reducing the generalization of the model.Finally,based on the above research work,the strategy of boundary mixed resampling was explored to obtain a more balanced sample dataset,and used this as the input of the model in 3).2)Research on neural network training optimization method for text label correlation miningCombining text label correlation mining,a neural network training optimization method for multi-label text classification was proposed.It contains three aspects: a co-occurrence matrix of labels was constructed,and a weight initialization method was designed for the final hidden layer to the output layer of the neural network to improve the output probability of the label co-occurrence combination which is commonly used;using the label misclassification cost as a cost-sensitive factor,a label-weighted cost-sensitive loss function was established so that the objective function can converge to a low-cost region;combined with the label co-occurrence frequency,an adaptive slanted triangle learning rate was designed to converge more accurately;on this basis,a neural network training optimization method based on text label association mining was designed.This method is universal and does not significantly increase the computing resource consumption.At the same time,this method will be used as the model training optimization method in 3).3)Research on sequence to sequence generate model with dynamic routing for multi-label text classificationConsidering multi-label text classification as label sequence generation,a sequence to sequence generation model based on dynamic routing named DR-SGM was designed.Aiming at the problem of insufficient text representation ability of the Encoder,in the Encoder layer,a punishable dynamic routing was designed to optimize the multi-head attention mechanism,and then an Encoder based on the multi-head attention mechanism was constructed;for the problem of cumulative error,in the Decoder layer,sparsemax and iterative weights w were used to optimize the dynamic routing process,and a dynamic routing aggregation layer was added;the cumulative error impact was weakened by globally sharing routing parameters,and a Decoder based on a dynamic routing was constructed;finally,combined with the research of 1)and 2),a multi-label text classification method for sample imbalance was proposed.4)Experiments and analysisThe F1 value,Hamming Loss,and G-mean were selected as the evaluation metrics.The experimental program for the boundary mixed data resampling method,neural network training optimization method,and DR-SGM model was designed.Based on the standard dataset,they are compared with other methods in the same field.The results show that the multi-label text classification method proposed in this thesis can effectively deal with imbalanced data and and achieved competitive results.
Keywords/Search Tags:data imbalance, multi-label text classification, label correlation mining, neural network training, sequence to sequence generation model
PDF Full Text Request
Related items