Font Size: a A A

Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model

Posted on:2016-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:G YangFull Text:PDF
GTID:2308330479484741Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data, how to efficiently search and processing all kinds of information, has become an important subject in current information construction of army. The base of all this processing, is Chinese word segmentation technology, its main purpose is make the Chinese statement translated into the correct word sequence which the computer can recognizing and understanding, to provide the most basic information material for the following information processing, complete natural language to computer language conversion. There is no accurate and efficient segmentation results, accurate and automatic processing of information is impossible. Therefore, research on Chinese word segmentation, has very important significance to improve the segmentation accuracy and speed.This paper firstly introduces the research status Chinese word segmentation technology, analyzed three kinds of word segmentation technology development and their advantages and disadvantages, based on the idea of information processing automation, propose a method of combining the advantages of rules and statistics for Chinese word segmentation. Then introduces and analyzes the segmentation algorithm model based on statistical with conditional random fields as the symbol.Combined with the advantages of rules and statistics method for Chinese word segmentation, is combined with the speed advantage of mechanical word segmentation, and the advantage of CRF statistical model to recognize unknown words. And improved the decoding process about CRF model at the same time, combined with the segmentation dictionary, and introduce the omni-word segmentation method, improves the efficiency of decoding. Word segmentation module as the core of word segmentation system, aiming at the disadvantages of ambiguity processing on mechanical word segmentation, and improves the segmentation algorithm, introduced the "nibble" strategy, in order to resolve some ambiguous, to improve the accuracy of segmentation. Make the whole word segmentation system can achieve the evolution of dictionary and statistical model as corpus is becoming richer. Let the automatic and intelligent information processing, in the process of the initial, is chinese word segmentation, get a preliminary implementation.Finally, based on the Bake Off 2005 standard corpus, building a real data test environment, and realize the closed test. Through the experiment, verify the effectiveness of the segmentation system, and horizontal comparison to other systems, the results show that the system has a fast segmentation speed and high accuracy,can achieve the design goal, this word segmentation system can be put into practical.
Keywords/Search Tags:Chinese word segmentation, Conditional random fields, Out Of Vocabulary, Ambiguous processing
PDF Full Text Request
Related items