Font Size: a A A

Keyphrase Extraction Using LDA Topic Models

Posted on:2017-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:X J LiuFull Text:PDF
GTID:2348330485462238Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the big data era, a large amount of information is on the Internet, and how to capture the key information accurately and rapidly becomes important. Massive textural data is created every day, including web news, research papers, and micro-blogs. It is hardly possible to deal with such amount of data manually. Keyphrases, instead, can summarize the topics of articles efficiently. They can help people to understand the content of articles and grasp key information quickly. Keyphrases are the minimum units to represent the core content of a document. They play an important role in many areas, such as automatic summarization of a document, extraction of web information, classification and clustering of documents, and information retrieval. However, traditional methods for labeling keyphrases manually cost time and labor. Therefore, it is necessary to design algorithms that can extract keyphrases automatically.In this dissertation, we focus on the study of keyphrase extraction algorithms. Topic models are introduced to extract keyphrases. Our main contributions are as follows.Firstly, we summarize the related work of keyphrase extraction from various aspects including corpus tagging, features of extracted phrases, and length of the text.Secondly, we introduce topic models considering the fact that traditional keyphrase extraction methods ignore the relationships between keyphrases and their document. Furthermore, keyphrases are extracted by combining n-grams and LDA topic models. As keyphrases should cover the topics of a given document as much as possible and represent the core content of the document, our proposed approach can extract good keyphrases from a given document by combining the topic distribution with statistical features.Thirdly, we propose a graph based keyphrase extraction approach using the LDA topic model. Based on the TextRank graph model, the phrases are used as the vertices of the undirected graph for a given document. Then the LDA topic model is used to calculate the weights of edges in the graph. Finally Top K scored phrases are selected as the keyphrases of the document. Experimental results show that the proposed algorithm outperforms the baselines.
Keywords/Search Tags:keyphrases, LDA, topic model, semantic features
PDF Full Text Request
Related items