Font Size: a A A

The Design And Implementation Of Text Topic Key Word Processing System Based Chinese Word Segmentation

Posted on:2015-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y F XuFull Text:PDF
GTID:2298330467957542Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advance of information technology, the online document increases rapidly. Information explosion has become the main feature. Key word is the brief summary of the article. It can help the reader understand the content of the article and save the time. At the same time, it plays an important role in information retrieval, automatic summarization, and text clustering and text classification. As a result, key word extraction has become the key technology of similar problems, and one of the important research contents in text mining.This thesis is based on the educational cloud project of Chinese Academy of Sciences. The program is developed on the demands of building the subject tree automatically in education field. There are a large quantity of methods to modeling text such as TF*IDF, Unigram Model, PLSA. But we found that the LDA is the best model to fulfill the task. This thesis attempted the extract the key word in the article by implementing LDA. But the result is not so good because of the low performance of Chinese word segmentation. The current segmentation algorithm can’t recognize the unregistered word well. This program solved the problem and improved the accuracy of key word extraction at a large scale by improving the performance of word segmentation. The concrete works are as follows:1. The implement of Chinese word segmentation:The Chinese word segment part optimized the accuracy of unregistered word by handling date, number, English and name on the basis of Back-Maximum-Matching algorithm.2. Text preprocesses:This part lays the foundation for the LDA, including removing the stop words, demising and inversing term index.3. The implementation of LDA:This part implemented the LDA using Gibbs Sampling to evaluation of parameter.4. The integration of the program:this part includes the integration of text input, Chinese word segmentation, preprocess of text, the implementation of LDA and the display of the result.The result showed that this program is completely in conformity with this thesis and can extract the key word well.
Keywords/Search Tags:Topic Model, Key Word Extraction, Chinese Word Segmentation, Text Mining
PDF Full Text Request
Related items