Font Size: a A A

Studies On And Implementation Of Selected Topics In Chinese Information Processing

Posted on:2009-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:L J LuoFull Text:PDF
GTID:2178360242474992Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, all kinds of resources; are continuously increasing. In order to quickly and efficiently find the information, information processing (IP) has become one of the most important research fields.In this thesis, we discuss some key issues in Chinese information processing. The major contents and contributions are as follows:Implementation of a dictionary based on syntactic parsing methods. By mapping grammar rules for the signature word, parsing is converted to the generation of characteristics of syntactic decision tree by the use the term, so that the rule-based method and probability method are effectively combined. In our closed test, the precision and recall are 87.13% and 89.40%, respectively.An improvement of the K-means clustering method has been proposed by the use of sample distance. This effectively avoids errors caused by the choice of the initial point, as well as noise and the impact of isolated points.The thesis also introduces a variety of Corpus deposited in the dictionary structure, as well as the use of the structure of the dictionary; on the characteristics of the word, the use of multi-storey hash storage, with the largest matching forward and achieve a rapid segmentation algorithm, with 1G RAM, the speed of word segmentation is 2 Megabytes per second. Through the combination of Hidden Markov Model POS tagging and the smoothing algorithm, we obtain a tagging precision of 86%, and a disambiguation of 82%. In KNN based classification algorithm, with the use of statistical methods CHI to select feature words, and by loading the document of relevant category behind these words, we have solved the problem of redundant information. Through the use of the characteristics of the sentence and term weighting, we have implemented the statistical mechanical automatic text abstraction. Through the use of vector space model, combined with synonymous term expansion and inverted file storage structure, we have implemented a simple information retrieval system.
Keywords/Search Tags:Chinese information processing, Corpus, syntax decision tree
PDF Full Text Request
Related items