Font Size: a A A

Research Of Combined Chinese Word Segmentation Method

Posted on:2015-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2268330428997157Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology, people have entered the information age taking the network as the core. In the society of information swelling, how to obtain and master useful information has become a focus of individuals, businesses and government. In this environment, Chinese information processing technology has become the hot spots of research and development of researcher, and one of the most important is the Chinese word segmentation technology. Chinese word segmentation is the process that can divide Chinese characters sequence of no specific delimiter into the Chinese semantic word sequence in accord with the particular context. It is the premise and basis of Chinese information processing, and is also a bottleneck restricting the development of the Chinese information processing technology.Eliminate ambiguity and unknown word recognition is the difficulty of Chinese word segmentation technology, and is also the important factor that affecting speed and precision of Chinese word segmentation. In recent years, in order to improve the speed and accuracy of the segmentation, there are already a lot of Chinese word segmentation methods. The improvement of these methods is mainly manifested in two aspects:Improvement of word segmentation dictionary, this improvement is mainly by reducing the number of matches which happen in text to be cut and dictionary to improve the speed of word; Improvement of word segmentation algorithm, this type of improvement is mainly through the improvement of their own algorithms to improve the segmentation system’s capability of ambiguity processing and unknown word recognition. This paper combines the two improved methods, according to the current status of key technologies designed a combination of dictionary, statistical and rule-based Chinese word segmentation, this method has the ability to detect and processing ambiguity and to recognize unknown words.This paper has studied the dictionary mechanism of Chinese word segmentation and word segmentation algorithm in detail, and puts forward a solution for Chinese word segmentation. The solution mainly proceed the following three aspects:The first is improvement of word segmentation dictionary mechanism, the improved dictionary aims at the characteristics of the Chinese center after partial and the2-words in Chinese information accounting for more proportion, using double word hash table structure (first word hash table and last word hash table), under the premise without promoting space and maintenance complexity of existed typical dictionary to achieve the fast matching of entry. The second is the detection and processing of the ambiguity, at the present stage, ambiguity recognition mostly uses bidirectional maximum matching algorithm, but because the bidirectional matching algorithm has more matching number, the backtracking-forward maximum matching algorithm has been proposed. This algorithm uses the way of backtracking word back to promote a Chinese word to detect a chain length of1and2-words crossing ambiguity, it reduces the matched times when detect crossing ambiguity, but this method has two defects, one is that it is only detect crossing ambiguity of chain length of1and2-words, can’t identify other types of crossing ambiguity of the chain length of1and chain length of2, so the ambiguity recognition ability is limited; Another defect is that it is also defragment the fields which not occurred crossing ambiguity, and resulting in duplicate matching problem. The paper aim at the two defects, adds a crossing ambiguity detection module of chain length of1and3-words on the basis of this algorithm, the improved algorithm not only can identify crossing ambiguity of chain length of1, but also can recognizes the crossing ambiguity of chain length of2, while taking advantage of counting methods, using the way of combination of rules and statistics to centralized eliminate gaps for the fields of continuous occurrence crossing ambiguity. The way of centralized eliminate gaps avoids repeat matching problem of crossing ambiguity fields of no occurrence at the elimination defragmentation, so reduces the time complexity of algorithm. The third is identifying of the unknown words. The paper combines with the improved algorithm using the method of probability model of existed recognition mechanism combined with rule to identify the unknown word.Test results on a large corpus show that the proposed combination-type Chinese word segmentation algorithm not only improves the segmentation accuracy but also identify unknown words. The system achieved satisfactory results on the overall performance.
Keywords/Search Tags:Chinese word segmentation, chain length, backtracking-forward maximummatching algorithm, Crossing ambiguity, Unknown word
PDF Full Text Request
Related items