Font Size: a A A

Short Text Topic Modeling Research Based On The Semantic Extension Of Knowledge Graph

Posted on:2021-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:D W ZhaoFull Text:PDF
GTID:2428330629952684Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The network is an important means for people to obtain information.Mobile devices such as mobile phones and computers have become an inseparable part of people's life.Network text has become one of the main ways for people to obtain and disseminate information,which makes text data present explosive growth.How to dig out the rules of text data and the hidden topic structure has become a hot issue in the field of machine learning.The topic model is widely used in the text field,and the potential topic structure in the data can be effectively mined through the modeling.However,when faced with extremely short texts(such as social media posts),traditional topic models are faced with serious sparsity problems due to the lack of text data and the absence of contextual information,and the modeling effect on short texts is usually poor.More and more researchers start to think about how to make up for the problem of sparse data.However,although most of the models expand the text content by various means,they ignored the semantic relation between words,each word in the text is independent.In the actual scene,the existing knowledge in people's minds is also very important for humans to understand the meaning of the text.Semantic knowledge can help people understand.Therefore,we can find out which words have a higher probability of belonging to the same topic through the semantic relationship between words and add word co-occurrence information to expand the short text representation.Knowledge map(such as WordNet)is popular at the present stage,it contains rich semantic relations between words.People can obtain the high quality of synonymous relations as well as the subordinate relationship,many applications in the field of long text topic modeling have obtained the good effect,but it has not been applied in the field of in this essay.This paper uses the additional semantic knowledge in WordNet as auxiliary information in the sampling process and proposes the topic modeling method WRDMM(WordNet base Dirichlet Multinomial Mixture)based on the knowledge map of semantic extension,witch combining semantic information to the topic model to increase word co-occurrence document level.The specific work of this paper is as follows:To obtain the semantic features among words and enhance the word cooccurrence information of the document,this paper proposes the WRDMM model,calculates the correlation degree between words through the semantic relations in WordNet,finds the word sets that are more likely to belong to the same topic,and combines the additional semantic features with the Dirichlet Multinomial Mixture Model.First,WRDMM mined the semantics from the perspectives of word neighborhood structure correlation and word similarity and calculated the semantic weight matrix of words by using Neighbor Similar or Leacock Chodorow Similar.Finally,in the training process,according to the semantic weight matrix and the close degree of the association between words and topics,the probability of similar words appearing under a certain topic is adjusted,and the co-occurrence frequency of the current words and the semantically similar words is updated to merge semantics into the parameter evaluation process.In this paper,four well-known large short text corpus sets in the NLP field are selected,and the WRDMMNS model and WRDMMLCH model obtained respectively according to two semantic similarities are compared with the other three representative baseline models.Experimental results show that the model proposed in this paper is superior to other models in classification and clustering effect,and can extract high-quality subject information from the short text,which also proves the feasibility of combining the knowledge map with the short text subject model.Among them,the model extended by neighborhood structure is more effective,and the model extended by Lch similarity is more suitable for the data set with more concentrated topics.
Keywords/Search Tags:Topic Model, WordNet, Short Text Classification, DMM
PDF Full Text Request
Related items