Font Size: a A A

Research On Class Semantics And Imbalanced Distribution Methods For Multi-Label Text Classification

Posted on:2024-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuFull Text:PDF
GTID:2568307103974699Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data,people need more powerful information extracting capabilities than ever before.Text classification plays an important role in many applications of information extracting,such as emotion classification,knowledge graph building,and has received widespread attention.Compared with the single-label text classification task,multi-label text classification helps to extract more text information,and is more challenging,which is embodied in the exponential growth of label types requiring more information to classify.What is more,the combination of labels makes data imbalance more difficult to deal with.In order to solve the problems of underutilization of label information and imbalanced distribution in the multi-label text classification task and improve the performance of the multi-label text classification model,we proposes two multi-label text classification algorithms.The main contributions as follows:(1)Since the traditional text classification models lack the use of label information,we proposes a Class Semantic Embedding algorithm.Different from other methods of fusing label information in the feature extraction stage,our Class Semantic Embedding algorithm considers that the classifier in the model has no prior information like the feature extraction part.Our algorithm uses the label text word embedding feature as the parameter of the classifier to fuses the label semantic information in the model classification stage based on the relationship between the classifier parameters and the class prototype from bottom to top by taking advantage of the characteristics of the end-to-end model.The Class Semantic Embedding algorithm not only improves the performance of the model for multi-label classification tasks,but also improves the training efficiency of the model without using additional network modules to increase the calculation amount of the model.(2)To solve the problem of imbalanced distribution of classes in data sets,we proposes a multi-label text classification algorithm based on hard example mining.Unlike the traditional up-sampling algorithm,which changes the distribution of training data at one time,our Hard Example Mining algorithm uses adaptive threshold to dynamically build the hard sample data set according to the learning situation of the model in each epoch,and considers the sampling objects from the perspective of sample rather than class,thus avoiding the problems caused by multilabel.In addition,in order to solve the problem of model parameter jitter caused by the adjustment of training data distribution,the exponential moving average method is introduced to update the model parameters.Finally,in view of the decline of the indicators of non-difficult samples in the model,an ensemble method that can learn weights is used to combine the advantages of the models trained from different distributions.(3)In order to verify the effectiveness of the algorithms above,the experiments are conducted on the multi-label text classification dataset AAPD and Reuters-21578 with the comparison among mainstream methods.The experimental results show that Class Semantic Embedding in the AAPD achieves the best results in both Micro F1 and Hamming Loss,which improves Micro F1 by 2.3% and Hamming Loss by 0.0017 compare with baseline.In the Reuters-21578 dataset,Class Semantic Embedding achieves the best result in Hamming Loss,the improvements compare with baseline are Micro F1 by 0.96 % and Hamming Loss by 0.0003.The Hard Example Mining achieves the best results in both Micro F1 and Hamming Loss in the AAPD and Reuters-21578.Compared with the baseline,the Hard Example Mining improves Micro F1 by1.59% and Hamming Loss by 0.0011 in AAPD.In the Reuters-21578 dataset,the improvements compare with baseline are Micro F1 by 1.7% and Hamming Loss by 0.0005.
Keywords/Search Tags:Multi-label text classification, class semantic information, class prototype, hard example mining, class imbalance
PDF Full Text Request
Related items