Font Size: a A A

Research On Text Multi-label Classification Algorithm Based On Label Correlation

Posted on:2020-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:T YangFull Text:PDF
GTID:2428330623956472Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development and wide application of information technology in the era of mobile Internet,text information is exploding in a geometric form.How to extract the most useful information for users from massive documents has become the focus of research.Text classification,as a key means in text data mining technology,which can help people quickly understand,organize and manage text information.According to the number of labels after classification,text classification can be divided into: single-label text classification and multi-label text classification.Multi-label text classification is more in line with the rules and characteristics of the objective world,and the single-label problem is only a special case in the multi-label problem.The multi-label problem is a more universal and generalized derivative of the single-label problem.Therefore,the multi-label text classification problem has more common application scenarios,and has important research significance and commercial value in the field of natural language processing.Although the multi-label text classification problem is widely used,the complexity of multi-label problem in data expression and the exponentiality of the label output space have brought more challenges to its solution.The results show that the correlation between labels can provide guidance for multi-label classification.At present,machine learning technology is in the ascendant,and there are many solutions in text multi-label classification.Most of these solutions do not consider the relevance of labels when dealing with multi-label classification problems.Therefore,this paper focuses on the multi-label classification problem in the field of text classification from the perspective of label correlation.The research work of this paper mainly includes:This paper elaborates on the key technologies in multi-label text classification,including text preprocessing,text representation,text feature extraction,classification algorithm research,etc.On this basis,combined with the characteristics of multi-label problem,this paper further analyses the limitations of current methods in multi-label text classification.This part of the work laid the theoretical foundation for the subsequent design of text feature extraction and multi-label classification algorithms.In view of the shortcomings of keyword extraction in current text feature extraction,this paper optimizes the TextRank algorithm and proposes a TextRankkeyword extraction algorithm based on PMI weighting.The point mutual information between vocabulary is used to measure the initial relationship between vocabulary,and then the influence probability transfer matrix between words and words is constructed.The weights of lexical nodes are convergent by iterative computation,and the final keywords are obtained by lexical weight ordering.Experiments show that the precision and recall rate of the proposed method are significantly higher than those of the original method,which verifies the superiority of the improved algorithm in text feature extractionIn this paper,the proposed keyword extraction algorithm is applied to the text multi-label classification problem.The extracted keywords are represented by word2 vec,and then they are weighted and accumulated as the vectorized representation of the text,so as to construct the input of the keyword-based multi-label classification model.At the same time,the labels of the training sample is also characterized by the word embedding,and mutil-label vectors are used as model targets after feature fusion.The cosine loss is used as the cost function of model training to train the neural network.When predicting multiple labels of a document without labels,the output of the network is retrieved in the word vector space of all labels by nearest neighbor method.And the most recent K labels with the cosine distance of the network output vector as the predicted multi-labels.Through comparison experiments,the stability of the method in text multi-label classification ability and the feasibility of label semantic extension are verified.This paper also considers the lack of keywords for text information representation ability,and attempts to use the convolutional neural network for text feature extraction,and eliminates the error caused by the extraction error in the keyword extraction step.When predicting multiple labels of a document without labels,the output vector of the trained model is retrieved in the word embedding space of all labels to obtain the multi-label classification results.At the same time,the reliability and stability of the text multi-label classification model based on convolutional neural network are verified by experiments.
Keywords/Search Tags:text classification, multi-label classification, text feature extraction, TextRank
PDF Full Text Request
Related items