Font Size: a A A

Research On Short Text Classification Based On SentenceLDA Topic Model

Posted on:2020-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2428330572967214Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Short text has become an important form of Internet individuals to express opinions and share information on a personal independent network platform.A large amount of daily information,such as Weibo,news sources,web search,and forum information,has gone far beyond the capabilities that humans can handle and understand.The short text content is simple and concise,the meaning is highly summarized,and it has extremely rich information resources.Understanding,processing,and categorizing large amounts of short text can uncover relevant information that users are interested in.Short text categorization is one of the important means of text data mining,and it is also a basic task of natural language processing in the fields of information filtering,information retrieval and user recommendation.How to quickly and accurately achieve large-scale automated short text classification is one of the hotspots and difficulties in the field of natural language processing.The characteristics of short text include: the length of the text itself is extremely short,the content information is sparse,the context co-occurrence information is insufficient;the context dependency is strong;the immediacy is strong and the data scale is huge.At present,the traditional automatic text classification technology based on long text has been mature and widely used.However,due to the above characteristics of short text,the effect of the relatively mature long text classification technology directly applied to short text classification is not very good.In view of the characteristics of short text length and sparse features,this paper starts with extending the original short text features,training short texts on trained theme models,and expanding features for short texts.Aiming at the shortcomings of traditional discretization text representation,Word2 Vec tool is used to train word vector,distributed representation of short text,and short text representation of weighted word vector and extended short text feature to obtain a new type of improved short text feature representation.Finally,use the text classification algorithm Support Vector Machine(SVM)to complete short text classification.The main work of this paper includes:(1)In view of the characteristics of short text feature sparseness,starting from the extension of the original short text feature,using the Sentence Latent Dirichlet Allocation(Sentence LDA,S-LDA)theme suitable for topic mining at the short text level The model obtains the topic distribution of the short text and the topic-character word distribution,and expands the original short text with the topic word as the feature word to realize the feature extension of the short text.(2)Using the external text corpus in the same field of the short text data set to train the word vector model,using the word vector model to obtain the word vector representation of the short text;for the word vector can not solve the phenomenon of "one word polysemy",using the weighted word vector It means that the word vector obtains a certain weight,and the short text feature representation extended by the topic model is used for sequential splicing,and the word vector and the topic vector are combined to obtain a spliced short text representation model,and finally the short text classification is realized.The experimental results show that the classification of short texts that have been extended by the topic features is somewhat improved compared to the use of the Vector Space Mode(VSM)for short text representation.Using the Word2 Vec tool,the distributed word vector is introduced into the short text representation,and the word vector is combined with the topic vector to perform the feature representation and expansion of the short text.The semantic information and sentence structure mining are performed from the "word" and "sentence" levels.It also improves the accuracy of short text classification.
Keywords/Search Tags:Short text classification, SentenceLDA, Topic model, Feature extension, Word embedding
PDF Full Text Request
Related items