Text classification is a primary task in the field of natural language processing,which aims to establish a mapping relationship between text and label based on existing text and label data,and then mark unseen texts.Text classification uses classifier to classify text by acquiring text features.Text classification is widely used in fields such as emotion recognition,information recommendation and so on.With the enrichment of application scenarios,the content of text data is richer and the classification granularity is gradually refined.For example,an article may involve multiple fields such as politics,finance,military and so on.Traditional single-label text classification can only map the text to single label,while multi-label text classification can map the text to multiple labels.Single-label text classification is difficult to meet the practical needs of more fine-grained classification,so it is of great practical significance and practical application value to study multi-label text classification.In multi-label text classification tasks,text data often contains a large number of documents and labels.Because different labels may share the same subset of documents,there are often complex semantic relationships between labels.At the same time,in long documents,some complex semantic information will be hidden in noise or redundant information,which is difficult to obtain.In addition,in some multi-label text data sets,a small number of labels are associated to a large number of documents,while a large fraction of tail labels are only related to a small number of documents,that is,the long-tailed distribution problem.This thesis studies the problem of capturing complex label dependencies in multi-label text classification and the long-tail distribution in multi-label text classification,and proposes the Adaptive label information learning with statistical features(ALISF)and the Multi-Information Filter Encoding Network(MIFEN).ALISF mainly focuses on capturing complex label dependencies,and MIFEN combines label features to further solve the long-tail distribution problem.The main contributions are as follows:(1)The Labeled Latent Dirichlet Allocation with adaptive topic priors(LDATP)is proposed.The model adjusts the Dirichlet prior parameters of the topic according to the situation of each text label set,and uses all topic constraint model.This model is a supervised model,it establishes the correspondence between the label and the topic.Through matrix operation,it assigns different Dirichlet priori values to the topic according to the document label set.It captures more precise topic-word relationships and generate topic-word probability distribution covering global information by using all topic to constraint the model.(2)A Label Information Integration Network(LIIN)is proposed,the network maps the probability distribution of the obtained subject words to the vector space,and uses the label graph structure to capture the label dependency to obtain the enhanced label vector representation.In order to capture the correlation features of labels,the network constructs a two-level graph convolution neural network that can transfer higher-order information,transfers information between the label graph nodes constructed using the label co-occurrence feature,obtains higher-order semantic features from the label neighbor nodes,updates the label space vector,and enhances the label embedding representation.(3)A Multi-information Filter Encoder is proposed,the encoder includes two kinds of filter encoders,namely,the learnable text information filter encoder(LTIFE)and the learnable label information filter encoder(LLIFE),which are used to attenuate noise information in text space and label space in frequency domain,and extract complex semantic information and label correlation in the filtered features in time domain.(4)A document representation method guided by filtered features is proposed.This method utilizes the correlation information of head labels and tail labels to guide the interaction process of filtering features,so as to capture more semantic information related to tail label features from text features and generates text-specific label representations to enrich the features of tail labels,and generates document representation by fusing extracted text-specific label features and text semantic features through concatenation and pooling operations.In order to evaluate the performance of ALISF and MIFEN,this thesis compares ALISF and MIFEN with the existing methods,and analyzes the experimental results of multiple measurement indicators.The experimental results show that the method proposed in this thesis can well learn the correlation among labels,capture more semantic information related to tail labels,and effectively solve the long tail distribution problem in multi-label text classification. |