Font Size: a A A

Chinese Word Segmentation Based On Statistic And Dictionary

Posted on:2006-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:F W DiFull Text:PDF
GTID:2168360155954639Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is a basic part of Chinese information processing, and the settlement will directly influence the development foreground of Chinese information processing. The Chinese information processing should be researched from 3 levels: the character level, the word level and the sentence level. We have studied much with the Chinese characters, and have been able to deal with it by and large. To the aspect of words, we also have done much about it, and have excogitated a lot of theories, but we still can't deal with it accurately. To the sentence level we have try to study about it, and have excogitated a few theories, so we have a lot to do in this field. In Chinese, word is the smallest element of the language. We can't deal with the sentence aspect research properly without a good solution on the word aspect research. If we want to go further in the sentence filed we must try to do better on the word level. The study on the word level is the basic of sentence level. The most important part of study on the word level is to make the computer to recognize every word of a sentence. To English, because the texts are parsed by spaces, it is easy for the computer to deal with the texts word by word. But in Chinese, the sentences are made of several single Chinese characters and there is not any separator between two characters. We must to find a way to fix a word in a text, which is just called "the Chinese word segmentation". The search on Chinese word segmentation has been done for more than 20 years, and there have been many theories and methods but no one can accurately parse all kinds of text. The only thing we can do and the only thing that we are doing is to do our best towards it. The mechanical settlement methods can be divided into two categories: using dictionary and using statistic perhaps combining regulations. Basing on the former research , this paper combines the two methods using a little regulation .The method of this paper can solve the problems of identifying the person name ,the place name, and the intersectional ambiguousness ,the ambiguousness caused by the natural language's multi_ meaning on some conditions. This paper take the arithmetic of the positive and Reverse Maximal Matching. First we dispose the text using the dictionary. Then we use the method of statistic to deal with the single words. Thirdly we compare the results of the positive and the reverse Maximal Matching to decide whether there are ambiguousnesses based on the former research. Then use the method of statistic and regulations to eliminate the ambiguousness. Because reverse Maximal Matching will be more correct, so give first priority to the reverse Maximal Matching. This paper includes two betterments. First we reconstructed the data structure, so the speed of the dictionary matching will be quicken up by a large degree, which is a key problem of Maximal Matching arithmetic. Some researches have been done on the aspect of reconstructing the data structure of the dictionary, such as "A Chinese parsing method based on word"which put forward a sort of data structure for dictionary using tree. Compared with this structure, ours is quicker to transact and easier to carry out. Second, we use the statistic to deal with the single words series which perhaps will be person name or new word. Meanwhile we combine...
Keywords/Search Tags:Segmentation
PDF Full Text Request
Related items