Font Size: a A A

On Multi-label Text Classification Algorithms Based On Deep Learning

Posted on:2018-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:W L YuFull Text:PDF
GTID:2428330566998326Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet industry,the generation and dissemination of information has reached unprecedented speed and the amount of data has exploded.The Internet is flooded with a great deal of text,audio,video and other types of data,although text information is the fastest,most massive data source undoubtedly.To manage the massive text information effectively,so that users can search and find quickly,text categorization is the most basic and crucial technology among them.In traditional classification tasks,a sample corpus often corresponds to only one class,and classification algorithms off the shelf can handle these sing label classification problems well.However,in real life,the text data is complex and changeable,and one sample is often associated with more than one category and belongs to more than one topic.For such a multi-label text classification problem,the traditional classification algorithm is difficult deal with.Therefore,it is of great realistic significance to design efficient and accurate multi-class text categorization algorithms and this has attracted more and more attention.Generally,there are two difficulties in the multi-class text classification algorithm.Firstly,the text data features high dimensions,less effective features,sparse and redundant.Secondly,the labels of one sample depend on each other,owning the high-order correlations.The main research content of this subject is to solve the bottleneck encountered in the traditional multi-class text categorization algorithms,to extract the effective features of the text corpus using the autoencoder model,modeling the inter dependencies of labels effectively,then design and implement the ML-LSTM multi label classification algorithm.In view of the sparsity and redundancy of text features,we use the autoencoder and max pooling model AE_P to extract the semantic features of texts effectively.General text data is presented in a vector space model,then the original data dimension is generally the total number of entries in a corpus.The entry of a sample corpus is only a small part,and the effective feature dimension is less with a great sparse attribute.The autoencoder is a non-linear feature extraction model,which can be extracted without supervised information.The effective expression of the original sparse features in low-dimensional space can significantly reduce the sparsity of features.Moreover,max pooling operation can effectively reduce feature redundancy.Experiments show that the features extracted by AE_P algorithm can improve the accuracy of the final classification results.In view of the label correlations,this paper proposes a ML-LSTM model,combining the data feature and label as data-label embedding,furtherly employing four kinds of serialization method,namely,sample clustering,association rules,the frequency method and the random initialization to determine the ranking of embedding.At each time step,employ the long and short term memory network(LSTM)combined with classical classification method to model the embedding.The dependency of labels can be well captured when classifying,and we demonstrate the effectiveness of ML-LSTM.
Keywords/Search Tags:multi-label, text classification, autoencoder, label correlations
PDF Full Text Request
Related items