Keyphrase Extraction Using LDA Topic Models

Posted on:2017-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:X J Liu

Full Text:PDF

GTID:2348330485462238

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the big data era, a large amount of information is on the Internet, and how to capture the key information accurately and rapidly becomes important. Massive textural data is created every day, including web news, research papers, and micro-blogs. It is hardly possible to deal with such amount of data manually. Keyphrases, instead, can summarize the topics of articles efficiently. They can help people to understand the content of articles and grasp key information quickly. Keyphrases are the minimum units to represent the core content of a document. They play an important role in many areas, such as automatic summarization of a document, extraction of web information, classification and clustering of documents, and information retrieval. However, traditional methods for labeling keyphrases manually cost time and labor. Therefore, it is necessary to design algorithms that can extract keyphrases automatically.In this dissertation, we focus on the study of keyphrase extraction algorithms. Topic models are introduced to extract keyphrases. Our main contributions are as follows.Firstly, we summarize the related work of keyphrase extraction from various aspects including corpus tagging, features of extracted phrases, and length of the text.Secondly, we introduce topic models considering the fact that traditional keyphrase extraction methods ignore the relationships between keyphrases and their document. Furthermore, keyphrases are extracted by combining n-grams and LDA topic models. As keyphrases should cover the topics of a given document as much as possible and represent the core content of the document, our proposed approach can extract good keyphrases from a given document by combining the topic distribution with statistical features.Thirdly, we propose a graph based keyphrase extraction approach using the LDA topic model. Based on the TextRank graph model, the phrases are used as the vertices of the undirected graph for a given document. Then the LDA topic model is used to calculate the weights of edges in the graph. Finally Top K scored phrases are selected as the keyphrases of the document. Experimental results show that the proposed algorithm outperforms the baselines.

Keywords/Search Tags:

keyphrases, LDA, topic model, semantic features

PDF Full Text Request

Related items

1	Research Of Annotation Based On Topic Models And Random Walks
2	Research On Semantic Reinforcement Based On Topic And Word Features For RNN Language Model
3	Topic Discovery From Social Network Texts With Heterogeneous Semantic Features
4	Extract Topical Keyphrases From Chiniese Text Corpora
5	Research On Extract Summary Based On Document Multi-dimensional Feature Integration
6	Research On Semantic Representation Of Text Based On Topic Model
7	Research On Evolution Model Of Microblog Topic Based On Time Sequence
8	Research On Microblog Topic Recognition Based On Neuro-semantic Topic
9	Research On Topic Modeling Method Based On Semantic Distribution Similarity
10	Semantic SLAM Based On Topic Model