Font Size: a A A

Study On Disambiguation Algorithm For Chinese Word Segmentation

Posted on:2006-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:2168360155972930Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
NLP (Natural Language Processing) is not only an important direction in the field of computer science but also an important branch of Artificial Intelligence. Chinese word segmentation is the foundation of NLP. There are many reasons that operated against the development of the Chinese Word Segmentation. Solving the ambiguity problem can be considered as the most important reason of these. Aiming at ambiguity in Chinese word segmentation, this paper studies the merits and disadvantages of solving ambiguity. It is found that the previous approaches can't satisfy the practical application. Two kinds of algorithms to solve ambiguity are proposed in this paper, which based on these researches. The two algorithms proposed in this paper form an independent modularity, which are to be applied in "Automatic Chinese Word Segmentation System". This paper introduces chiefly the NLP at first. Then it advance there is no natural segmentation mark in Chinese words, which is one of the biggest differences between Chinese and English in NLP researches. So it should be an automatic segmentation process before dealing with the input text to perform further researches. With the development of automatic segmentation, more and more researchers have focused on solving the ambiguity problem. We study algorithm of solving ambiguity at the main part of the paper. Firstly ambiguity'recognition methods are introduced. Then two algorithms are presented after analyzing the insufficiencies and advantages of these algorithms. One is an algorithm, denoted by HB, which is based on Hidden Markov Models and word Bigram to deal with the crossing ambiguity. The key idea of the HB algorithm is to combine word Bigram with POS Bigram. This algorithm deals with the part of speech and provides a new way for solving the crossing ambiguous. Another one is an algorithm, denoted by SR, which is based on the Support Vector Machines (SVM) and the rules to deal with the combinatorial ambiguity. The key idea of the SR algorithm is to solve combinatorial ambiguity making use of the theory of SVM and rules of parts of speech. At the ending part of this paper,these two algorithms are applied in "Automatic Chinese Word Segmentation System", which was developed by our team. In open and closed test of various types of Chinese corpus, we made a comparison between the two algorithms and ICTCLAS System, which is developed by the Chinese Academy of Sciences. The experimental result shows that the accuracy of segmentation for ambiguity has been improved greatly. It is also proved that the two algorithms are feasible to solve ambiguity. Then the paper makes a conclusion and gives suggestion for the future researches.
Keywords/Search Tags:Natural Language Processing, Word Segmentation, Crossing ambiguity, Combinatorial ambiguity, Hidden Markov Model, Part of Speech Tagging, Support Vector Machine
PDF Full Text Request
Related items