Study On Disambiguation Algorithm For Chinese Word Segmentation

Posted on:2006-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Liu

Full Text:PDF

GTID:2168360155972930

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

NLP (Natural Language Processing) is not only an important direction in the field of computer science but also an important branch of Artificial Intelligence. Chinese word segmentation is the foundation of NLP. There are many reasons that operated against the development of the Chinese Word Segmentation. Solving the ambiguity problem can be considered as the most important reason of these. Aiming at ambiguity in Chinese word segmentation, this paper studies the merits and disadvantages of solving ambiguity. It is found that the previous approaches can't satisfy the practical application. Two kinds of algorithms to solve ambiguity are proposed in this paper, which based on these researches. The two algorithms proposed in this paper form an independent modularity, which are to be applied in "Automatic Chinese Word Segmentation System". This paper introduces chiefly the NLP at first. Then it advance there is no natural segmentation mark in Chinese words, which is one of the biggest differences between Chinese and English in NLP researches. So it should be an automatic segmentation process before dealing with the input text to perform further researches. With the development of automatic segmentation, more and more researchers have focused on solving the ambiguity problem. We study algorithm of solving ambiguity at the main part of the paper. Firstly ambiguity'recognition methods are introduced. Then two algorithms are presented after analyzing the insufficiencies and advantages of these algorithms. One is an algorithm, denoted by HB, which is based on Hidden Markov Models and word Bigram to deal with the crossing ambiguity. The key idea of the HB algorithm is to combine word Bigram with POS Bigram. This algorithm deals with the part of speech and provides a new way for solving the crossing ambiguous. Another one is an algorithm, denoted by SR, which is based on the Support Vector Machines (SVM) and the rules to deal with the combinatorial ambiguity. The key idea of the SR algorithm is to solve combinatorial ambiguity making use of the theory of SVM and rules of parts of speech. At the ending part of this paper,these two algorithms are applied in "Automatic Chinese Word Segmentation System", which was developed by our team. In open and closed test of various types of Chinese corpus, we made a comparison between the two algorithms and ICTCLAS System, which is developed by the Chinese Academy of Sciences. The experimental result shows that the accuracy of segmentation for ambiguity has been improved greatly. It is also proved that the two algorithms are feasible to solve ambiguity. Then the paper makes a conclusion and gives suggestion for the future researches.

Keywords/Search Tags:

Natural Language Processing, Word Segmentation, Crossing ambiguity, Combinatorial ambiguity, Hidden Markov Model, Part of Speech Tagging, Support Vector Machine

PDF Full Text Request

Related items

1	Word Segmentation And Pos Tagging In Chinese
2	Chinese Word Found Its Part Of Speech Tagging
3	Application Of Hidden Markov Model In Part-of-Speech Tagging
4	Research On Chinese Part-of-speech Tagging Based On Semi Hidden Markov Model
5	Research On Laodian Participle And Part-of-speech Tagging Method
6	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
7	Research On Kirghiz Basic Part-of-Speech Tagging Based On HMM
8	Software Requirements Verification Based On Natural Language Processing
9	The Research Of Part-of-speech Tagging Based On Hidden Markov Model
10	The Study And Analysis Of Oracle Bone Inscriptions Based On Statistical Natural Language Processing