Research And Implementation Of Chinese Auto-segmentation System

Posted on:2006-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Dai

Full Text:PDF

GTID:2168360155472930

Subject:Computer application technology

Abstract/Summary:

NLP (Natural Language Processing) is an important branch of Artificial Intelligence. Chinese auto-segmentation is the foundation of NLP, and also is a crucial issue in NLP. Chinese auto-segmentation system can be used for automatically recognizing the Chinese words. Up to now, although there are some research efforts in this field, however, there are still some problems in the practical applications, which need to be solved by further research. The main objective of this research is to design and implement a Chinese auto-segmentation system. After an analysis of the main difficulties, in order to reduce the difficulty of segmentation and improve the precision of segmentation, this research has designed and realized a Chinese auto-segmentation system based on a multi-step process strategy. The main work of this paper includes: First, this paper introduces the language model and the algorithms of Chinese auto-segmentation systems, and presents an algorithm to deal with the ambiguity of Chinese time words. Time words refer to expressions indicating both exact time and periods of time. The algorithm gets a 90% accuracy, which shows the effectiveness of the proposed algorithm. Second, this paper gathers, coordinates and establishes natural language resource the study needed, which mainly includes the manual segmentation label corpus's gathering, processing and settling, and the crude corpus's gathering and processing, dictionary, and the knowledge warehouse' building. At the same time, the text's non-Chinese characters and Chinese figure strand is also studied. The core work of the paper is designing and implementing a Chinese auto-segmentation system based on a multi-step processing strategy. The system includes some modules such as originally segmenting, POS tagging, ambiguity processing, model smoothing and Unknown Word Recognizing. Original segmenting is to find out the potential routes in sentences. Ambiguity processing refers to eliminating ambiguities using Bi-gram or the POS label wholehearted model, and combining ambiguities by SVM. The POS detecting method is used to realize the function of the Unknown Word Recognizing. Model smoothing technique is embodied in the process of the POS label and ambiguity processing. Last, the paper validates the system's performance by experimentation. The system reaches a precision of 96.94% compared with artificial segmentation. The speed is between 1000 to 1400 words per second. Although the effect and precision is not as good as ICTCLAS, a Chinese auto-segmentation system developed by CAS, some new methods will be helpful for the future study. At the same time, the paper summarizes all the work and puts forward further work.

Keywords/Search Tags:

natural language processing, Chinese auto-segmentation, statistical language model, time word

Related items

1	Study On Chinese Word Segmentation Based On Recurrent Neural Network Language Model
2	Study On Chinese Named Entity Recognition
3	The Methodology And Implementation Of Chinese Natural Language Query In Databases
4	Research And Implementation Of Chinese Word Segmentation Algorithm
5	Chinese Word Segmentation Model Based On Improved Bidirectional LSTM-CRF
6	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
7	Research On Chinese Word Segmentation Based On Text And Audio
8	The Study And Analysis Of Oracle Bone Inscriptions Based On Statistical Natural Language Processing
9	Natural Language Processing-A Study Of Vectorization Of Chinese Words And Short Texts
10	Research On Chinese Word Segmentation Methods Based On Deep Learning