Font Size: a A A

Research On Keyword Extraction Based On Latent Topic Model And New Word Discovery

Posted on:2015-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:M YuanFull Text:PDF
GTID:2298330467463288Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the network technology, information grows explosively. Therefore, the way to find the desired information from massive data quickly is particularly important. Keywords are an important way to grasp the topics of an article for people, and they provide an important basis for user to filter information. Keyword extraction technology is widely used in information retrieval, text classification and content recommendation fields and so on.The traditional keyword extraction methods only consider the external statistical information of the words, but neglect the impact of topic information and internal structure of articles to the performance of keyword extraction. This may cause the topic of the extracted keywords over a single and not related to the topic of articles. To deal with these problems, a keyword extraction method using latent topic models and new word discovery is presented in this paper. Specific research works are as follows:Keyword extraction method based on latent topic model is proposed, in which the topic and structure information of articles are used for keyword extraction. The topic information of articles is constructed by topic models, through which all the words of articles are mapped into the topic space. On the other hand, to achieve the internal structure of articles, document network is constructed by co-occurrence window. This method finally extracts keywords by combining the PageRank model and the small-world network model. Experimental results show that this method can effectively use the topic and the internal structure information of articles, and the extracted keywords have better relevance and coverage to the topic of articles compared with the method based on TFIDF.Keyword extraction method based on new word discovery is proposed in this paper, and it can improve the readability of the keywords. As the initial part of keyword extraction, the performance of the Chinese word segmentation will directly affect the keyword extraction results, and the most important issue in word segmentation is the new word identification. In this method, new words are found from the corpus by using statistical methods, and it can avoid these new words cut wrong by the word segmentation. At the same time, by the new word discovery, words can be combined into phrases which have stronger expressing ability and can improve the readability of the keywords. Experimental results show that this method can effectively improve the system performance.
Keywords/Search Tags:keyword extraction, topic models, new word discovery, pagerank, small world network
PDF Full Text Request
Related items