Research On Short Text Classification Technology Based On LDA Feature Extension

Posted on:2020-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:T Zheng

Full Text:PDF

GTID:2428330590451260

Subject:Mechanical engineering

Abstract/Summary:

With the rapid development of the Internet,the information transmitted by the mobile Internet with mobile phones and tablets as the terminal has exploded,which makes the processing of information data extremely difficult,and it is difficult for people to quickly find the data they need.If text data can be effectively classified,it is much easier for people to find data.People can analyze the classified data and make corresponding assessments and forecasts.As a part of data mining,short text classification has been widely used in microblog hotspot tracking,product after-sales analysis and other fields.Short text classification has received more and more attention.This paper introduces the meaning of short text classification and analyzes the research status of short text classification at home and abroad.At the same time,the main links of preprocessing,Chinese word segmentation,feature selection and performance evaluation in short text classification process are briefly introduced,and the overall process of text classification is systematically introduced.The feature extraction method of traditional text classification technology is analyzed.Aiming at the problems of sparse feature and serious semantic loss,combined with LDA model,a feature selection method based on feature extension of LDA model is proposed.Train the LDA model with a large document set,get the "document-theme" distribution and the "topic-word" distribution,select the words under the maximum probability theme,and expand them into short text.When selecting the optimal theme,the confusion index will lead to too many LDA model themes and the topic recognition is not high.This introduces the optimal number of topics in the LDA model from the perspective of topic similarity and confusion,namely the Perplexity-Var indicator.Aiming at the problem that the support vector machine algorithm has low accuracy,an integrated classification algorithm based on paired constrained sampling is proposed.Through the paired constrained sampling method,the difference of each training set is increased,the training set is selected according to the class dispersion degree,and finally the integrated classification model is trained by Bagging algorithm.Three sets of experiments were set up for comparison.The results show that the integrated algorithm has better accuracy and generalization performance.

Keywords/Search Tags:

Short text classification, LDA, feature extension, Paired constraint, Integrated learning

Related items

1	Research On Short Text Classification Method Based On Feature Extension
2	Short Text Classification Algorithm Of Deep-learning Based On Feature Extension
3	Extreme Short Text Classification Based On Knowledge Graph Features Extension
4	Short Text Classification Based On Integration Of Ontology And BTM Feature Extension
5	Research On Short Text Classification Of Chinese News Based On Machine Learning
6	Short Text Classification Based On Feature Extension
7	Research On Short Text Data Stream Classification Based On Feature Extension And Selection
8	Feature Extension Methodfor Short-text Classification Based On LDA
9	Research On Short Text Classification Based On Semantic Extension
10	Research On Short Text Classification Based On Ensemble Learning