Font Size: a A A

The Research On Topic Modeling In Short-Text Scenarios With Auxiliary Semantic Information

Posted on:2020-04-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y LuFull Text:PDF
GTID:1368330578463139Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic modeling can discover topics of text data effectively,which belongs to the research area of text mining.It has been widely applied to various research domains such as information retrieval,natural language processing and so on.In early years,there have been a lot of successful theoretical researches and practical applications of topic modeling in long text scenarios.However,with the emergence of online social network,more and more short-text data have occurred.This has brought new chal-lenges for topic modeling in short-text scenarios.First of all,in short-text scenarios,each document only consists of few words and lacks enough contexts.It will cause the sparsity problem to conventional topic models because directly using them may fail to achieve topics with high quality.So short-text topic modeling has become a new research branch recently.Existing researches mostly focus on improvement from data and improvement from models,which have poor universality and lack the use of semantic information.So we can conduct further researches to overcome these drawbacks.What's more,conventional topic models always represent a topic with a set of unigrams.This may result in ambiguity and poor readability.So the display of topics is also an important research problem.Existing researches often display topics with bigrams.However,most of them have deficiencies of high complexity and poor uni-versality,So it is necessary to conduct researches to improve topical readability.To solve the sparsity problem brought by short-text data,this paper conducts de-tailed researches of topic modeling in short-text scenarios from two perspectives:mod-eling topics based on single word and modeling topics based on word pairs.All the proposed models have exploited the auxiliary semantic information learned from word embeddings or language models.To improve the readability of topics,this paper con-ducts researches by introducing a bigrams generation algorithm for displaying topics.Generally speaking,this paper has introduced three work and proposed four models based on the above-mentioned strategies.The main researches of this paper include:1.Constructing pseudo documents with semantic informationIn short-text scenarios,it is quite difficult to achieve high-quality topics by di-rectly applying conventional topic models because the context words for training are not enough.One of the strategies is to model topics based on each word in the vocabulary.We can infer the topic distributions over each original document by using the topical information of all words.So it is one of the most important research goals to learn topic distributions of each word.This paper has designed a topic model based on pseudo documents,named SEMIPS.The SEMIPS top-ic model constructs pseudo documents for each word,which can describe the semantic information of this given word.So we can treat the topic distribution over pseudo document as the same as the topic distribution over the given word.To construct pseudo documents,we involve the semantic information brought by word embeddings and exploit the similarity between words.Experiments includ-ing performance of models,ablation studies and utility of word2vec,have shown that SEMIPS can solve the sparsity problem effectively.2.Involving global biterms for modeling topicsModeling topics based on each word may involve noise data.While modeling topics based on word pairs can avoid this situation.This strategy is based on the following idea:The number of word pairs in the corpus is quite plentiful,so we can infer the topic distributions over original corpus by using the topic dis-tributions over word pairs.This strategy can also solve the sparsity problem.It assumes that two semantically similar words share the same probability to gen-erate the same topic.Existing work extracted words,which co-occur in a local context as a word pair.But it is insufficient to simply use local co-occurrence.For example,two synonyms rarely co-occur in most cases,so we'd better dis-cover more semantically similar word pairs for training.In this work,we exploit semantic similarity between words to find all semantically similar word pairs from the global corpus,which can improve the quality of topics.We call this model as GloSS.Experiments including performance of models,utility of differ-ent word embeddings and so on,have shown the necessity and effectiveness of involving global biterms with semantic information.3.Involving quantifiable relationship between words for modeling topicsIt is effective to solve the sparsity problem by modeling topics on word pairs,but it models topics based on the co-occurrence of word pairs and ignores the relationship between words.So we have introduced a model called RIBS,which involves quantifiable relationship between words into the learning procedure of topics.RIBS is designed to make two more related words have a higher prob-ability to generate the same topic.We involve the relationship in the form of prior knowledge,including the generation relationship between words and the significance of words.We use the output of language model trained with recur-rent neural networks to represent the generation relationship and use the value of inverse document frequency to represent the significance.What's more,we have proposed a RIBS-Bigrams topic model by introducing a topical bigrams genera-tion algorithm.So that we can display topics with phrases.This work can both solve the sparsity problem of short texts and improve the readability of topics at the same time.Experiments including evaluation of topics,performance of doc-ument characterization,have shown the effectiveness of proposed models.The positive effects of language models on topic discovering have provides a new strategy to optimize topic models with the help of NLP algorithms.
Keywords/Search Tags:Topic Modeling, Short Text, Word Embeddings, Recurrent Neural Net-works
PDF Full Text Request
Related items