Font Size: a A A

Design And Implementation Of Chinese Word Segmentation Model Based On Combination Of Statistics And Rules

Posted on:2014-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:H HeFull Text:PDF
GTID:2248330398475397Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularization of information technology, people’s work and learning have been inseparable from network information. Meanwhile, with the constant expansion of the scale of the network information, how to efficiently and accurately obtain relevant Chinese information gradually becomes an issue of concern. Chinese word segmentation is an important part of Chinese information processing, the segmentation precision of Chinese word segmentation system directly affects the efficiency of Chinese information processing and understanding. Therefore, this paper launches the research in this area is of great significance.First of all, the research background and significance of Chinese word segmentation are described in this paper, and the basic principle of three kinds of commonly used Chinese word segmentation methods and their advantages and disadvantages are analyzed, and two technical difficulties, ambiguity recognition and unknown words recognition, in Chinese word segmentation are also discussed. Moreover, the causes of ambiguity, classification and popular methods of ambiguity’s extracting and eliminating are expounded in detail. The classifications of unknown words and main identification methods have also been carefully interpreted. And several statistical model used in this paper is introduced briefly.Afterward, Chinese word segmentation technology based on the Cascaded Hidden Markov Model (CHMM) and Augmented Transfer Network (ATN) is studied deeply. A Chinese word segmentation framework is proposed by combining CHMM with ATN syntactic analysis together and a Chinese word segmentation system prototype based on this framework is achieved. Specifically, the system employs the method of combining the statistics-based N-shortest path pre-segmentation model with ATN syntactic analysis together to implement ambiguous segmentation, and applies simple rules to identify numerals and time words, and uses role-based unknown word recognition method to discern the Chinese person names and place names. Then, the system put the identified unknown words and other words together to participate in the competition, and creates a class-based HMM segmentation model to obtain the global optimum segmentation sequence and tags the Part-of-Speech (POS) for this sequence.Finally, segmentation experiments on three aspects of the system are accomplished in this paper. We have collected100sentences as ambiguity corpus, and comparative experiments about ambiguity segmentation are carried out using these sentences.This paper can properly analyze83sentences, but the choosed domestic segmentation system can properly analyze75sentences. Randomly selected test corpora in six different fields are used in the open test. Small parts of1998-year People’s Daily corpus are used as the test corpora for the comparative experiment and the segmentation results of this paper and the choosed segmentation system are compared and analyzed.The experiment results indicated that the precision of ambiguity recognition is better than that of the choosed segmentation system. By the test of randomly selected test corpora in six different fields, the average value of segmentation accuracy rate, segmentation recall rate and segmentation F index are94.28%,96.25%,95.25%, respectively. The comparative experiment results show that segmentation recall rate of this paper is slightly higher than that of the choosed segmentation system, and the overall segmentation accuracy rate is consistent with the choosed segmentation system.
Keywords/Search Tags:Chinese Word Segmentation, Augmented Transfer Network (ATN), HiddenMorkov Model (HMM), Ambiguity Recognition, Unknown Words Recognition
PDF Full Text Request
Related items