Design And Implementation Of Chinese Word Segmentation Model Based On Combination Of Statistics And Rules

Posted on:2014-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:H He

Full Text:PDF

GTID:2248330398475397

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularization of information technology, peopleâ€™s work and learning have been inseparable from network information. Meanwhile, with the constant expansion of the scale of the network information, how to efficiently and accurately obtain relevant Chinese information gradually becomes an issue of concern. Chinese word segmentation is an important part of Chinese information processing, the segmentation precision of Chinese word segmentation system directly affects the efficiency of Chinese information processing and understanding. Therefore, this paper launches the research in this area is of great significance.First of all, the research background and significance of Chinese word segmentation are described in this paper, and the basic principle of three kinds of commonly used Chinese word segmentation methods and their advantages and disadvantages are analyzed, and two technical difficulties, ambiguity recognition and unknown words recognition, in Chinese word segmentation are also discussed. Moreover, the causes of ambiguity, classification and popular methods of ambiguityâ€™s extracting and eliminating are expounded in detail. The classifications of unknown words and main identification methods have also been carefully interpreted. And several statistical model used in this paper is introduced briefly.Afterward, Chinese word segmentation technology based on the Cascaded Hidden Markov Model (CHMM) and Augmented Transfer Network (ATN) is studied deeply. A Chinese word segmentation framework is proposed by combining CHMM with ATN syntactic analysis together and a Chinese word segmentation system prototype based on this framework is achieved. Specifically, the system employs the method of combining the statistics-based N-shortest path pre-segmentation model with ATN syntactic analysis together to implement ambiguous segmentation, and applies simple rules to identify numerals and time words, and uses role-based unknown word recognition method to discern the Chinese person names and place names. Then, the system put the identified unknown words and other words together to participate in the competition, and creates a class-based HMM segmentation model to obtain the global optimum segmentation sequence and tags the Part-of-Speech (POS) for this sequence.Finally, segmentation experiments on three aspects of the system are accomplished in this paper. We have collected100sentences as ambiguity corpus, and comparative experiments about ambiguity segmentation are carried out using these sentences.This paper can properly analyze83sentences, but the choosed domestic segmentation system can properly analyze75sentences. Randomly selected test corpora in six different fields are used in the open test. Small parts of1998-year Peopleâ€™s Daily corpus are used as the test corpora for the comparative experiment and the segmentation results of this paper and the choosed segmentation system are compared and analyzed.The experiment results indicated that the precision of ambiguity recognition is better than that of the choosed segmentation system. By the test of randomly selected test corpora in six different fields, the average value of segmentation accuracy rate, segmentation recall rate and segmentation F index are94.28%,96.25%,95.25%, respectively. The comparative experiment results show that segmentation recall rate of this paper is slightly higher than that of the choosed segmentation system, and the overall segmentation accuracy rate is consistent with the choosed segmentation system.

Keywords/Search Tags:

Chinese Word Segmentation, Augmented Transfer Network (ATN), HiddenMorkov Model (HMM), Ambiguity Recognition, Unknown Words Recognition

PDF Full Text Request

Related items

1	Research Into Chinese Word Segmentation Based On Statistic And Regulation
2	Comparative Research On Open-Source Chinese Word Segmentation Machines
3	Research And Implementation Of Chinese Word Segmentation Algorithm
4	Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval
5	The Study Of Maximum-Match-Based Written Chinese Automatic Segmentation
6	Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research
7	The Research Of Unknown Chinese Work Recognition And Its Application To Chinese Input Method
8	Research On Chinese Word Segmentation Of Search Engine
9	The Research And Implementation Of The Intelligent Words Segmentation In Domain Chinese Understanding And Its Application In Products Designing
10	Research And Implementation Of Chinese Word Segmentation System For Enterprise Information Retrieval