The building of corpus is the basic work in the area of Chinese information processing. The processing of Chinese corpus includes Chinese word segmentation and part-of-speech tagging. They are widely used in many researches (for example, the automatic searching of Chinese text, machine translation, and Chinese characters identification and so on), and they provide important study resources for these researches.The effective use of corpus strongly depends on its processing level and quality. Now, we have written a lot of software for Chinese corpus processing, and have gained great achievements. But the outcome of them cannot answer our needs very well, and needs further improvements.The paper aims at improving the accuracy of Chinese word segmentation and part-of-speech tagging, studies and analyzes the two phases respectively:1. It discusses and analyzes the actuality of Chinese word segmentation, and describes an approach to correcting the Chinese word segmentation automatically based on rules. It compares the corpus processed by computer with the right, acquires the rules for Chinese word segmentation correction, and then corrects the corpus automatically based on these rules.2. It discusses and analyzes the actuality of Chinese part-of-speech tagging, and describes an approach to correcting the Chinese part-of-speech tagging automatically. It mines rules from right-tagged corpus using the method of rough sets, and then corrects the results of part-of-speech tagging automatically.3. We have designed and implemented an experiment system for the correction of Chinese word segmentation and part-of-speech tagging. The results of close-test and open-test of the system for Chinese word segmentation correction are 93.75% and 81.05% respectively, and the results of close-test and open-test of the system for Chinese part-of-speech tagging correction are 90.40% and 84.85% respectively.
|