Font Size: a A A

Research On Short Text Classification Technology Based On LDA Feature Extension

Posted on:2020-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhengFull Text:PDF
GTID:2428330590451260Subject:Mechanical engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the information transmitted by the mobile Internet with mobile phones and tablets as the terminal has exploded,which makes the processing of information data extremely difficult,and it is difficult for people to quickly find the data they need.If text data can be effectively classified,it is much easier for people to find data.People can analyze the classified data and make corresponding assessments and forecasts.As a part of data mining,short text classification has been widely used in microblog hotspot tracking,product after-sales analysis and other fields.Short text classification has received more and more attention.This paper introduces the meaning of short text classification and analyzes the research status of short text classification at home and abroad.At the same time,the main links of preprocessing,Chinese word segmentation,feature selection and performance evaluation in short text classification process are briefly introduced,and the overall process of text classification is systematically introduced.The feature extraction method of traditional text classification technology is analyzed.Aiming at the problems of sparse feature and serious semantic loss,combined with LDA model,a feature selection method based on feature extension of LDA model is proposed.Train the LDA model with a large document set,get the "document-theme" distribution and the "topic-word" distribution,select the words under the maximum probability theme,and expand them into short text.When selecting the optimal theme,the confusion index will lead to too many LDA model themes and the topic recognition is not high.This introduces the optimal number of topics in the LDA model from the perspective of topic similarity and confusion,namely the Perplexity-Var indicator.Aiming at the problem that the support vector machine algorithm has low accuracy,an integrated classification algorithm based on paired constrained sampling is proposed.Through the paired constrained sampling method,the difference of each training set is increased,the training set is selected according to the class dispersion degree,and finally the integrated classification model is trained by Bagging algorithm.Three sets of experiments were set up for comparison.The results show that the integrated algorithm has better accuracy and generalization performance.
Keywords/Search Tags:Short text classification, LDA, feature extension, Paired constraint, Integrated learning
PDF Full Text Request
Related items