Font Size: a A A

Research Of Keywords Extraction Algorithm For Chinese Text Based On Gene Expression Programming

Posted on:2010-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:R X GuanFull Text:PDF
GTID:2178330338475926Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Due to the the rapid development of Internet technology, information technology is profoundly affecting people's lives. Blog, electronic documents and data content make up the ocean of data, the highly effective text information processing service is needed urgently for the users. Text information processing consists of text categorization, text clustering, text mining and approximate query processing. In all these aspects, keywords extraction is widely used. It is not only indispensable for information retrieval, but also used as an important step to build the library. The aim of keywords extraction is select the subject words automatically which reflect the content accurately. Although there are considearable research effort overseas, research work on Chinese keywords extraction still in its infancy.Firstly, the basic concepts of natural language processing, text preprocessing and feature items are introduced. The well-known systems and commonly used algorithms for keywords extraction are compared and analyzed, including the GenEx system for English text, Naive Bayes algorithm, the maximum entropy model and the PAT TREE for Chinese text. Also, we classify the works into three categories.Secondly, we present our keywords extraction algorithm Term frequency, Location & Distance algorithm (TFLD) for Chinese text. The algorithm is based on the three characteristic properties.The weight computation model of keywords candidates is critical for TFLD. We use the Gene Express Programming (GEP) techniques to get the weight computation model, which constructs an express tree to optimize the impact of words on the computation model. Based on the traning set known as a priori, the sum of the variance between training set and the keywords extracted automatically is used as the fitness function. The evolution optimization technique ensures the weight computation model of TFLD meet the threshold constraints.Besides, based on Least Mean Square law, we using the learning techniques to achieve the scale parameters of the weight computational model of TFLD.Finally, evaluation experiments are conducted to compare our TFLD with other counterparts. It is shown that a considerable improvement can be obtained in keywords extracton for Chinese text.
Keywords/Search Tags:keywords extractrion, Gene Express, feature items, Chinese text
PDF Full Text Request
Related items