Font Size: a A A

Topic Extraction Algorithm Based On NP-Chunking And Phrase Weight Calculation

Posted on:2015-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:M M SunFull Text:PDF
GTID:2298330467979321Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Recently, as the Internet expands, the network has become the main way for people to get information. However, information explosion also makes people spend more time to get information what we want. Information fatigue, information anxiety and information overload have become the new problems we encountered in the information age. We need new computational tools to help organize, search, and understand these vast amounts of information. Topic model is an effective means of integrating information, so it has become a hotspot in Natural Language Processing.In this paper we concentrate on probabilistic topic models, and introduce noun phrase extraction and phrase weight calculation to topic model, so that, we can improve the performance of topic models.Firstly, we try to introduce noun phrase extraction to topic model. By POS-tagging and sentence structure analysis, we can get noun phrases from text files. We treat a noun phrase as a whole, so that every word in this phrase can be generated from the same topic.Secondly, we propose an algorithm to calculate phrase weight based on semantic network, to reduce noise in text files. We build a semantic network for every text file, and then use markov random walk idea to calculate the transition probabilities between nodes. After removing a single node, we recalculate the transition probabilities between nodes again. The difference between these transition probabilities can be deemed weight for this node. It’s also a kind of weight for a phrase in text file. Finally, we combine noun phrase extraction and phrase weight calculation together. First, we get noun phrases from text files, and then calculate phrase weight in every text file. After preprocessing, these file can be inputted to topic models, and meaningful topics can be extracted from corpus.
Keywords/Search Tags:Topic model, Noun phrase extraction, Semantic network, Phraseweight, Topic intensity, Generalization ability
PDF Full Text Request
Related items