Font Size: a A A

TF-IDF And Rules Based Automatic Extraction Of Chinese Keywords

Posted on:2016-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:P NiuFull Text:PDF
GTID:2308330461976536Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a basic issue of natural language processing, keywords extraction provides basic support for information retrieval, text clustering, automatic summary generation and so on. It allows users understanding the main content of the article quicky and easily, so they can quickly determine what to do next.The keywords extraction work in this paper is divided into candidate words identification and keywords extraction. Through Chinese keywords extraction experiments, we found that the performance of segmentation and the selection of candidate words can affect the performance of later extracting work. For this reason, we combine keywords extraction and key phrases extraction as same task to improve the performance of the keywords extraction, and also pay more attention on the candidate words identification work.On candidate words identification, we propose a method to identify unknow words consists of continuous single word, and a method for multi-word expression recognition. For the continuous single word, we use few simple rules to segment consecutive word fragments, it can better identify this kind of unknown word which occurs only once. For multi-word expression, we combine POS templates and LocalMaxs approach to identify. We can better identify the low frequency unknown words without depending on the size and field of corpus.According to previous studies, we can see that TF-IDF has better applicability and scalability, and it widely used in keywords extraction. So, we still choose TF-IDF as the main feature and add some new features to improve it. Considering that one word may has different parts of speech in an article, we compute the TF feature in a different way. Also, taking the universal issues into accout, we only add headline weight and word length information to change the TF-IDF formula to improve the performance of keywords extraction work.We presents several experiments to verify the performances of candidate words identificaion and keywords extraction methhods. In keywords extraction experiment, we also confirmed how candidate words play important role on later extracting keywords. Compared to the traditional TF-IDF, the value of P,R and F of the improved TD-IDF method improves about 5%. Both of them used our method to select candidate words.
Keywords/Search Tags:Keyword Etraction, Unknown word recogition, Candidate word selection, TF-IDF
PDF Full Text Request
Related items