| In recent years,the use of deep learning algorithm for multi-label,hierarchical,fine-grained classification of large-scale electronic text has become the mainstream.However,the task of large-scale multi-label text hierarchical classification has the following two problems:(1)the confusion of label relations.In hierarchical multi-label long text data,the correlation between each label and each part of the text is different,and there is a correlation and hierarchical relationship between labels.How to extract and utilize the correlation between label and label become a major difficulty in the task of hierarchical text classification.(2)Class Imbalance.On large-scale text data sets,the class distribution of real data is generally a long-tailed distribution.Multiple rounds of training on this kind of data sets will lead to the problem of overfitting of most sample classes and underfitting of a few sample classes,thus affecting the effect of text classification.Therefore,how to alleviate the class imbalance from the perspective of algorithm is also one of the difficulties of text hierarchical classification.In view of the above two problems,this paper proposed three improved text hierarchical classification algorithms based on deep learning,in order to alleviate the confusion of label relations and the imbalance of data classes,and to improve the performance of multi-label text hierarchical classification algorithms.the main work is as follows:(1)This paper proposed an improved algorithm for decoupling text feature representation and classifier based on adversarial training.In order to solve the problem of underfitting performance of a small number of sample classes caused by the imbalance of data classes in large-scale text data sets,We decoupled text feature representation and classifier learning.for the problem of over-fitting of most sample classes,adversarial training and different sampling classes are used to suppress it,and finally a twostage training method is formed.In the first stage,the text feature representation is learned by sample equalization sampling and adversarial training algorithm,and in the second stage,the classifier parameters are adjusted by class equalization sampling.In order to alleviate the imbalance of data classes and improve the robustness of the model.The improved algorithm is tested on two kinds of publicly-available data sets.Compared with the original model on the RCV1 publicly-available data set,the improved algorithm increases 5.89%and 11.02%on the Micro-F1 and Macro-F1 evaluation indexes,and 5.01%and 4.34%on the Micro-F1 and Macro-F1 evaluation indexes on the NYTimes publicly-available data set.(2)This paper advanced an algorithm to construct label confusion relationship based on graph convolution neural network.Most of the true label representations use one hot encoding,which did not take into account the label sibling relationship and hierarchical relationship.Therefore,this paper designed a graph convolution neural network,through the composition of the label co-occurrence relationship in the data,and then uses the graph convolution neural network to extract the label relationship features and integrate them into the label one hot encoding.through the fusion of label features and text features to improve the label confusion problem and performance of the model in large-scale text hierarchical classification.The algorithm is tested on two kinds of publicly-available data sets.Compared with the original model on the RCV1 publiclyavailable data set,the algorithm increases 4.62%and 9.58%on the MicroF1 and Macro-F1 evaluation indexes,and 5.39%and 6.43%on the MicroF1 and Macro-F1 evaluation indicators on the NYTimes publicly-available data set.(3)This paper raised a adversarial domain adaptive hierarchical text classification algorithm based on maximum mean difference algorithm and correlation alignment algorithm.Due to the problem of unsatisfactory model effect and over-fitting caused by the difficulty of collecting some kinds of data,the data set is divided into head classes and tail classes according to the amount of data.the features of the two parts of data are mapped to high-dimensional space and the maximum mean difference algorithm and correlation alignment algorithm are used to fit the feature distribution of the two fields.Migrate the rich features of the header classes domain to the tail classes.Finally,it can alleviate the problem of class imbalance and improve the training effect of the model.The algorithm is tested on two kinds of publicly-available data sets.Compared with the original model on the RCV1 publicly-available data set,the algorithm increases 3.31%and 8.79%on the Micro-F1 and Macro-F1 evaluation indicators,and 6.81%and 5.71%on the Micro-F1 and Macro-F1 evaluation indicators on the NYTimes publicly-available data set. |