Font Size: a A A

Research On Chinese Text Chunking Based On Co-training

Posted on:2006-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiuFull Text:PDF
GTID:2168360155958180Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Syntax analysis is always a basic task in the natural language processing, part analysis, also called shallow parse or chunk identification, becomes a hotspot in the natural language processing. Now, chunk identification is widely used in many fields of natural language processing, especially in the example based machine translation (EBMT), in which chunk identification is one of major techniques.With the development of the theory of machine learning , machine learning methodes becomes more and more attractive in the nature language processing, especially unsupervised and semi-supervised machine learning methods. It lies on two points, one is that Labeled training set is the base of most methodes of NLP, however, the work of labeling the training set by hand is costly, it needs many people with strong expert knowledge working hardly. The other is, with the advent of the information era and the development of the Internet, the content of Internet increases with exponential speed, we can get these raw data freely and use them in the NLP research.In this paper we build a research work on the recognition of Chinese chunk with the Co-training method. We give the definition of Chinese Chunk, then discuss formalized definition of Co-training algorithm under the PAC framework. Firstly, we define the two "views"of examples by choosing two Classifiers based on different algorithm theory. And then we proposed a example selection method based on the consistence, using two classifiers: Transductive HMM and fhTBL to combine a classification system to perform the Chinese Text chunking task with the small-scale labled Chinese treebank and large-scale unlabled Chinese corpus. The result were compared with the self-training result ---the result of the non Co-training experiment in which we only used the small-scale Chinese treebank as training data and use one classifier (Transductive HMM or fnTBL) to recognize the Chinese chunk. The improvement is significant, the F1 value of the two classifiers reached 83.41%, 85.34%, get a improvement of 2.13 points and 7.12 points respectively.
Keywords/Search Tags:part analysis, example based machine translation, machine learning, text chunking, self-training, Co-training, consistence, example selection strategy
PDF Full Text Request
Related items