Research On Chinese Text Chunking Based On Co-training

Posted on:2006-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Liu

Full Text:PDF

GTID:2168360155958180

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Syntax analysis is always a basic task in the natural language processing, part analysis, also called shallow parse or chunk identification, becomes a hotspot in the natural language processing. Now, chunk identification is widely used in many fields of natural language processing, especially in the example based machine translation (EBMT), in which chunk identification is one of major techniques.With the development of the theory of machine learning , machine learning methodes becomes more and more attractive in the nature language processing, especially unsupervised and semi-supervised machine learning methods. It lies on two points, one is that Labeled training set is the base of most methodes of NLP, however, the work of labeling the training set by hand is costly, it needs many people with strong expert knowledge working hardly. The other is, with the advent of the information era and the development of the Internet, the content of Internet increases with exponential speed, we can get these raw data freely and use them in the NLP research.In this paper we build a research work on the recognition of Chinese chunk with the Co-training method. We give the definition of Chinese Chunk, then discuss formalized definition of Co-training algorithm under the PAC framework. Firstly, we define the two "views"of examples by choosing two Classifiers based on different algorithm theory. And then we proposed a example selection method based on the consistence, using two classifiers: Transductive HMM and fhTBL to combine a classification system to perform the Chinese Text chunking task with the small-scale labled Chinese treebank and large-scale unlabled Chinese corpus. The result were compared with the self-training result ---the result of the non Co-training experiment in which we only used the small-scale Chinese treebank as training data and use one classifier (Transductive HMM or fnTBL) to recognize the Chinese chunk. The improvement is significant, the F1 value of the two classifiers reached 83.41%, 85.34%, get a improvement of 2.13 points and 7.12 points respectively.

Keywords/Search Tags:

part analysis, example based machine translation, machine learning, text chunking, self-training, Co-training, consistence, example selection strategy

PDF Full Text Request

Related items

1	Research On Discriminative Training Methods For Statistical Machine Translation
2	Optimization On Translation Knowledge In Statistical Machine Translation
3	Reinforcement Learning-Based Neural Machine Translation Models
4	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
5	Research On Support Vector Machine Accelerated Training Algorithm
6	Research On Semi-supervised Mongolian-Chinese Neural Machine Translation Based On Cooperative Training
7	Machine learning approaches for dealing with limited bilingual training data in statistical machine translation
8	Training Large-Scale Statistical Machine Translation Models On Spark
9	Implementation Of Indonesian Machine Translation System Based On Deep Learning
10	Identification Of English Functional Noun Phrases For Machine Translation