Font Size: a A A

Research On Short Text Topic Model Based On Semantic Information And Word Triangle

Posted on:2020-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:W JingFull Text:PDF
GTID:2428330575458241Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the accelerating pace of social development and the "short and fast"user ex-perience brought by smart mobile terminals,people's communication on the network is becoming more and more fragmented.Therefore,short text data plays a increasing im-portant role in network information exchange nowadays.For example,social network status,micro-blog text messages,traditional news headlines,short video headlines and question-and-answer websites are forms of short text.On the other hand,With the rise of super companies,like microblogging,Zhihu,Facebook,Twitter and so on,short text data is generated and accumulated at a great speed.Therefore,mining topic informa-tion from massive short text data is of great value.Such as public opinion analysis,information retrieval,personalized recommendation,user interest clustering and so on are all the applications of topic mining.However,using traditional text mining meth-ods to mine thematic information of short texts is very difficult,mainly because the co-occurrence information of words in short texts is very sparse.In order to get more feature information from short texts,scholars have proposed various improved models,but most of them ignore the semantic relationship between words.In order to solve this problem,this paper first proposes a short text topic model based on the priori knowl-edge of semantic information and word frequency information.On this basis,the topic unit structure is studied,and a semantic word triangle topic model is proposed.The main work of this paper is as follows:1)In view of the problem that traditional topic models treat word pairs of different importance equally,this paper assumes that the more closely semantically related words are of,the more likely they are to belong to the same topic.On this basis,the paper measure the semantic similarity of words by introducing the words embedded training on the external corpus,and put the prior knowledge of the distribution of information subject words in order to make the model pay more attention to those words with larger semantic similarities.2)In view of the problem that traditional words have an impact on the quality of high-frequency words in the topic model,this paper assumes that words appearing in most documents have a weak ability to represent the topic.Based on this hypothesis,this paper introduces IDF and semantic similarity as prior knowledge of word distri-bution,alleviating the impact of high frequency words on topic quality.Based on the BTM model,an improved WEI-BTM is proposed,which improves the performance of the topic model with traditional words.3)In view of the neglect of common word co-occurrence networks for some pairs of words which have close semantic connections but few co-occurrences,this paper proposes a new method to construct semantic word networks,which enables the word networks to pay more attention to the subject links between words in an all-round way.Furthermore,on the basis of this network,a more closely related basic unit-the wood semantic word triangle structure is proposed.On this basis,a SWTTM short text topic model is proposed.4)This paper also makes two comparative experiments on two real-world Chinese short text datasets with three traditional baseline models.The experimental results show the superiority of the SWTTM model in short text topic mining.
Keywords/Search Tags:Short Text, Topic Model, Word Network, Word Triangle, Word Embedding
PDF Full Text Request
Related items