Font Size: a A A

Research On Short Text Classification Based On Topic Model

Posted on:2022-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2518306527470134Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology and continuous changes in the form of information dissemination,netizens can freely express their views on Weibo,We Chat,and E-commerce platforms,which has been led to a rapid increase in the number of short texts.How to timely and accurately mine the effective themes of short texts and apply them to the fields of personalized product recommendation,public opinion monitoring,and sentiment analysis,has become a problem that information managers must solve,and short text classification has become an important research direction.Short text has the characteristics of short text length,less effective features and insufficient semantic expression and so on,which will affect the accuracy of short text classification.However,as a basic text analysis task,topic mining can dig out potential topic information from a large-scale text,so it plays an important role in short text classification.This article uses the topic model to improve from the following two aspects:A.Aiming at the problem of fewer short text vocabulary,the paper proposes a method to use LDA topic model to construct feature word sets between topic categories for short text feature expansion.This method is based on the extension of its own semantics.When the short text is extended,the semantic similarity between the original text features and the constructed subject feature word set is used to extend the short text,to a certain extent,which overcomes the problem that noise is easy to be introduced when using external resources for feature extension and directly using the subject words for feature extension.This paper conducts related test experiments on Sogou news corpus.The results show that the method of constructing feature word set expansion between topic categories is more effective than the method of directly using document topic word expansion and directly using the VSM model to represent short text for classification.B.Aiming at the problem of insufficient semantic expression of short texts,the paper proposes a text representation method based on the embedding of part-of-speech topic vector.This method firstly uses the DBOW and DM models to construct the document vector,then combines the Word2 vec vector containing context information and the part-of-speech LDA topic vector with global latent semantics and uses the part-of-speech weights to construct the topic vector,and finally calculates the Euclidean distance between the topic vector and the document vector to represent short text.This paper conducts related test experiments on the two data sets of Sogou news corpus and web crawling e-commerce reviews.The results show that the text representation method constructed in this paper is better than other benchmark text classification methods.
Keywords/Search Tags:LDA topic model, Feature extension, Text representation, Euclidean distance, Short text classification
PDF Full Text Request
Related items