Font Size: a A A

Classification For Chinese Short Text Based On Multi LDA Models

Posted on:2015-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:J F GuoFull Text:PDF
GTID:2298330422490932Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of micro-blog, Micro message and other newmedia, Chinese short text information was explosive growth, how to efficientlyorganization and management of text information, has become a pressing problems.The theme of the text classification can improve the situation of messy, reduce thequery time, improve search quality, quickly and effectively get the text information.The theme of the text classification task is identifying one or more categories oftext based on the theme classification system.Traditional text classificationalgorithm based on machine learning, artificial predefined categories, and identifiesthe category corpus, in the face of the massive documents, it is high cost forartificial identification, and the classifying quality dependent on artificialidentification.This article focuses on automatically construction text theme classificationsystem for a large-scale document, and efficiently identify the categories.LDA topic model is a good method of mass text mining, it can automaticallymine the text topic. In the result topic, exist some noisy topics, their high frequencywords are usually composed of random words, common words, can not representa real text subject. Text information entropy, characterization words probabilitycoverage, characterization words variance,and Topic independence detectionalgorithm, can achieve automatic filter the noisy topic. Because of the unbalancedcorpus and the number of predefined categories, Topics from different models arecomplementary to each other. Using AP clustering algorithm combined with the IGPevaluation indicator and BWP evaluation indicator can build a complete text themeclassification system.To improve the classification performance, we use multi model of concurrentvoting mechanism, effectively extend the text theme, improves the classificationaccuracy and stability.The experimental results show that the text theme classification system basedon multi LDA models, actually improve the classification accuracy and stability. Wetest the Classification performance on the Micro message public account data, theoptimum F-value is0.89, higher than that of single model of LDA value of0.77,corresponding to the optimal F-value is0.72for SVM.
Keywords/Search Tags:Short text, theme, classification, LDA, multi model, high quality Topic, clustering
PDF Full Text Request
Related items