Font Size: a A A

Research On Short Text Classification Method Based On Semi-Supervised BTM Model

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2428330599960553Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of e-commerce and the increase of social networking platform under Web 3.0,many short-text forms of information appear on the network,such as micro-blog,commodity reviews,news headlines and so on.For the classification research of short texts,the probabilistic topic model is often used as the short text topic mining model.Because the short text itself contains less information and less words,the probabilistic topic model has sparse features for the processing of short text data.The problem,therefore,is one of the current research hotspots when effectively classifying these short texts.First of all,because the text of the title of the scientific paper has only a dozen words,the use of common word segmentation can not achieve good word segmentation effect,resulting in the final classification result is not good enough.For this problem,sort out some commonly used proprietary phrases,set up artificial intelligence domain dictionaries,and add domain dictionaries when segmentation,In order to improve the classification effect.Secondly,for the BTM model in text classification,the classification of documents is not clear.Based on the BTM model and the idea of semi-supervised learning,a semi-supervised BTM model is proposed.Since the label text is required to be added,two to four seed words are set,and the seed word is expanded by using Word2 Vec and TF-IDF to obtain the label word and the label word is added as the label text to the model input to construct the semi-supervised BTM model.Thirdly,according to the method of word pair extraction in the BTM model,the combination of a large number of invalid word pairs leads to the long iteration time,and an improved word pair extraction algorithm is proposed.The algorithm is a frequent item set mining algorithm based on semantic analysis.The WordNet dictionary is used to build a synonym dictionary.Then,based on the actual situation of extracting only two frequent item sets,a frequent item set mining algorithm is constructed to extract the word pairs.Finally,the validity of domain dictionary addition,SSBTM model and improved word pair extraction SSBTM model is verified by comparative experiments.The first two experimental validation domain dictionary additions and SSBTM model can improve the classification accuracy.The third experiment verifies that the improved word pair can extract the SSBTM model to speed up the operation.
Keywords/Search Tags:semi-supervised, short text classification, LDA model, BTM model
PDF Full Text Request
Related items