Font Size: a A A

Chinese Word Segmentation, Key Technology Research

Posted on:2010-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:W F CaoFull Text:PDF
GTID:2208360275498894Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation (CWS) is a process of turning a series of Chinese characters into a series of Chinese words with some rules. As the fundamental component of Chinese information processing, CWS is wildly used in correlative areas. Accordingly, research on CWS has important theoretic and realistic meaning. In this thesis, we mainly research on the techonogies of dictinory, ambiguity resolution, and segmentation algorithm etc.According to the statistical characters of Chinese words, the algorithm in this thesis is based on dictionary and statistics. When we organize the core dictionary, we consider the efficient of time and space, and the statistical characters of Chinese words, so we use the method of first two characters Hash table, to form a Trie-tree which depth is 2, while the rest characters of the words are stored by order. Ambiguity resolution and unknown word identification are the two difficulties in CWS, in this thesis, we focus on resolve overlapping ambiguity, and we present a new way to find the overlapping ambiguous segmentation position: we organize all the candidate words as a 2-dimension segmentation graph, and if there are words both at the top and the right of an atom character, it means that there is an overlapping ambiguity at this position. Then we use the method of double character coupling and difference of t-test to decide the position. At last, we convert all the candidate words and their distances into a direction map with length and without circle, to get the best CWS result by computing the shortest path between the start node and end node of the map.Our experiment is made on a PC equipped with Pentium 4 CPU at 2.0 G-Hz, 256 M-Byte of RAM. As the experiment results show that the speed of this algorithm can reach 35000character/s, the precision rate can reach 97.2%, and the recall rate can reach 93.7%. The performance of the algorithm can meet the demand of most application systems.
Keywords/Search Tags:Chinese Word Segmentation, Hash Index, Probability and Statistics, Shortest Path Algorithm
PDF Full Text Request
Related items