Font Size: a A A

Research And Implementation Of Chinese Auto-segmentation System

Posted on:2006-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:J Y DaiFull Text:PDF
GTID:2168360155472930Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
NLP (Natural Language Processing) is an important branch of Artificial Intelligence. Chinese auto-segmentation is the foundation of NLP, and also is a crucial issue in NLP. Chinese auto-segmentation system can be used for automatically recognizing the Chinese words. Up to now, although there are some research efforts in this field, however, there are still some problems in the practical applications, which need to be solved by further research. The main objective of this research is to design and implement a Chinese auto-segmentation system. After an analysis of the main difficulties, in order to reduce the difficulty of segmentation and improve the precision of segmentation, this research has designed and realized a Chinese auto-segmentation system based on a multi-step process strategy. The main work of this paper includes: First, this paper introduces the language model and the algorithms of Chinese auto-segmentation systems, and presents an algorithm to deal with the ambiguity of Chinese time words. Time words refer to expressions indicating both exact time and periods of time. The algorithm gets a 90% accuracy, which shows the effectiveness of the proposed algorithm. Second, this paper gathers, coordinates and establishes natural language resource the study needed, which mainly includes the manual segmentation label corpus's gathering, processing and settling, and the crude corpus's gathering and processing, dictionary, and the knowledge warehouse' building. At the same time, the text's non-Chinese characters and Chinese figure strand is also studied. The core work of the paper is designing and implementing a Chinese auto-segmentation system based on a multi-step processing strategy. The system includes some modules such as originally segmenting, POS tagging, ambiguity processing, model smoothing and Unknown Word Recognizing. Original segmenting is to find out the potential routes in sentences. Ambiguity processing refers to eliminating ambiguities using Bi-gram or the POS label wholehearted model, and combining ambiguities by SVM. The POS detecting method is used to realize the function of the Unknown Word Recognizing. Model smoothing technique is embodied in the process of the POS label and ambiguity processing. Last, the paper validates the system's performance by experimentation. The system reaches a precision of 96.94% compared with artificial segmentation. The speed is between 1000 to 1400 words per second. Although the effect and precision is not as good as ICTCLAS, a Chinese auto-segmentation system developed by CAS, some new methods will be helpful for the future study. At the same time, the paper summarizes all the work and puts forward further work.
Keywords/Search Tags:natural language processing, Chinese auto-segmentation, statistical language model, time word
PDF Full Text Request
Related items