Font Size: a A A

Automatic Chinese Word Key Technology Research And Implementation

Posted on:2009-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:W ShiFull Text:PDF
GTID:2208360245961642Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, along with computer's popularization and the information development, Chinese information processing technology application is widely used, the Chinese word automatic segmentation is the foundation of Chinese information processing, has applied in text checking, machine translation, text classification, automatic abstract, information retrieval, man-machine interface of computer and so on, and already became a front topic of Chinese information processing.This thesis has improved in many ways from the traditional Chinese automatic segmentation system and the current Chinese automatic segmentation technology, and completed a practical and efficient segmentation system prototype. The paper is mainly about the method of Chinese automatic segmentation, the way of handling ambiguity word segmentation, and the method of special word automatic segmentation. Follow research has been done:1. In the preliminary segmentation module, a new segmentation method using special characteristics words was proposed, and the corresponding rules of those words were constructed. Used those words' feature to initial cut, it will not only enhance the segmentation speed, but also solve some problem of ambiguity.2. In the precise segmentation module, some improvement has been made to the traditional forward maximum match algorithm. The improved algorithm dynamically determine the length of the matching word, it will not only reduce the average number of matching operation, but fully embodies the "long-term priority "principle, so the new algorithm can increase the speed of the segmentation.3. In the processing of the ambiguity, a multi-tiered step was used to eliminate ambiguity, in the first step the preliminary segmentation module characteristics was used to eliminate a part of the word ambiguity, and there are still some ambiguities of the words field can be final eliminated in the ambiguity elimination module. It takes advantage of the "long-term priority" to further enhance the segmentation results correction.4. In the dictionary design aspect, the word was divided into two categories: the word in the true sense of and the word prefix, this segmentation process more conducive to enquiries, and further enhance the speed of the segmentation.5. The naming entity appeared in Chinese text such as Chinese and foreign personal name, the geographic name, the time and the digit, the article analysis their characteristics, and design the corresponding segmentation method.The experiments show that, the realization of the Chinese automatic segmentation prototype has a high-speed segmentation, the average reached 445,348 Chinese characters/sec, while the segment exact rate has reached 98.08%, obviously that the system has good performance.
Keywords/Search Tags:Chinese word automatic segmentation, ambiguous word segmentation, maximum match, name Identification
PDF Full Text Request
Related items