Research Of Keywords Extraction Algorithm For Chinese Text Based On Gene Expression Programming

Posted on:2010-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:R X Guan

Full Text:PDF

GTID:2178330338475926

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Due to the the rapid development of Internet technology, information technology is profoundly affecting people's lives. Blog, electronic documents and data content make up the ocean of data, the highly effective text information processing service is needed urgently for the users. Text information processing consists of text categorization, text clustering, text mining and approximate query processing. In all these aspects, keywords extraction is widely used. It is not only indispensable for information retrieval, but also used as an important step to build the library. The aim of keywords extraction is select the subject words automatically which reflect the content accurately. Although there are considearable research effort overseas, research work on Chinese keywords extraction still in its infancy.Firstly, the basic concepts of natural language processing, text preprocessing and feature items are introduced. The well-known systems and commonly used algorithms for keywords extraction are compared and analyzed, including the GenEx system for English text, Naive Bayes algorithm, the maximum entropy model and the PAT TREE for Chinese text. Also, we classify the works into three categories.Secondly, we present our keywords extraction algorithm Term frequency, Location & Distance algorithm (TFLD) for Chinese text. The algorithm is based on the three characteristic properties.The weight computation model of keywords candidates is critical for TFLD. We use the Gene Express Programming (GEP) techniques to get the weight computation model, which constructs an express tree to optimize the impact of words on the computation model. Based on the traning set known as a priori, the sum of the variance between training set and the keywords extracted automatically is used as the fitness function. The evolution optimization technique ensures the weight computation model of TFLD meet the threshold constraints.Besides, based on Least Mean Square law, we using the learning techniques to achieve the scale parameters of the weight computational model of TFLD.Finally, evaluation experiments are conducted to compare our TFLD with other counterparts. It is shown that a considerable improvement can be obtained in keywords extracton for Chinese text.

Keywords/Search Tags:

keywords extractrion, Gene Express, feature items, Chinese text

PDF Full Text Request

Related items

1	Automatic Extraction Of Keywords And Text Summarization In Text Mining
2	Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set
3	Gene Name Recognition Feature Selection Methods In Biomedical Research Text
4	Research On Chinese Text Localization Methods In Natural Scene Images
5	Research And Implementation Of Text Mining Technology Based On Public Security Information
6	Issues In TCM Text Mining
7	Study Of Chinese Text Similarity Based On Number Difference Gene
8	Research On Automatic Question Answering System In Restricted Domain Based On Chinese Weighted Keywords Tree
9	A Study On Key Issues Of Automated Text Categorization For Chinese Documents
10	Research On Antomatic Chinese Text Summarization Of Web-oriented Text Mining