Font Size: a A A

The Design And Implementation Of Chinese Word Segmentation System

Posted on:2013-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:G YuFull Text:PDF
GTID:2248330374985664Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of computer technology and the emergence of different levels of application requirements is resulting in a sharp increase of the data on network. The Chinese have a large number of users, but how can we extract the useful information that we wanted in a flood of Chinese information? The first step obviously is to let the computer "understand" the human language. In Chinese, a Chinese word is the smallest language unit that has independent significance. The right segmentation of the word is the first step to Chinese natural language processing, but also a crucial step. We can only to cross this difficult point, before we talk about on a deeper level of Chinese information processing.Throughout the current segmentation methods, segmentation methods can be divided into the three kinds, they are rule-based segmentation, statistics-based segmentation and the segmentation based on understanding. Just as saying different methods both have their advantages and disadvantages. This paper analyzes the pros and cons of the existing segmentation techniques, and then the author used a combination of word segmentation method based on statistics and rule-based segmentation. Taking the essence of other method to make up for their own disadvantages. The method of using the advantage of Hidden Markov Model and the rules of the dictionary would improve the efficiency and accuracy of the word segmentation But the two important and difficult points are disambiguation and unknown word recognition.During the first step of processing, this thesis deals it with the improvement algorithms of the shortest path. And following the spirit of "Say yes if you know. Say no if you don’t know" principle, Divide out a word when you very much sure for that, if you can not determine, just do not make any processing. This step can be to retain the larger possibility, and pass the currant result to the follow-up steps that deal with different levels to gradually solve the problem. What initial step tries to do is making the final segmental word result as the optimal one. Here is also the embodiment of the Maximum Entropy Principle. The unknown word recognition refers to the Chinese names, place names as well as the Chinese translation of the foreigners Chinese is so profound, various names are more popular, the Chinese names consist the first and the last name, this is considered as a stable pattern, but still much too arbitrary, it is difficult to identify all the names through the traditional rules or method. Place names are relatively fixed, and the translation that is most commonly used of the foreign translation can also be obtained through the survey. We can add them into the dictionary before processing, this method can complete the identification of them, The article mainly focus on the difficulty of the identification of names, and presents a statistical model based on the context, which also stems from the Chinese names which often play a role when they appears in the sentence, so I use this additional information, according to their adhesion to the prefix and suffix to further determine whether they should be recognized as names.In ambiguity eliminate aspects, the ambiguity is divided into tow kinds:semantic ambiguity and interpretive ambiguity, and the main two problems which ambiguity to solve are crossed ambiguity and combined ambiguity. In general, the crossed ambiguity can get good segmentation results based on ambiguity fields. Reactive in crossed ambiguity, the combined ambiguity needs more contextual information, sometimes it is necessary to determine based on entire sentence. The maximum entropy model is a probabilistic model of combining contextual information, and the elimination of combined ambiguity needs to use contextual information, so the maximum entropy model is suitable for the elimination of combined ambiguity.The thesis introduces the overall architecture of system and functions of each part. The experimental results show that the algorithm of early cut stage can receive good results, but identification of the unknown word stage still needs further experiments because of the failure to obtain a good marked unknown word dictionary. In generally, the system can complete the normal segmentation and achieve the desired effect.
Keywords/Search Tags:Chinese word segmentation, segmentation method, unknown wordrecognition, ambiguity eliminate
PDF Full Text Request
Related items