Font Size: a A A

Research On Chinese Chunking Based On Statistical Learning Method

Posted on:2009-09-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:G L SunFull Text:PDF
GTID:1118360278461933Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era, many technologies of language processing are widely used to process the amount of electronic data from web. For a lot of research in natural language processing fields and related applications, it is valuable in both theory and practice to study automatic Chinese chunking method with high performance, which represents the trend of shallow parsing technology.As the skill of obtaining the large scale of text with natural language, the machine learning method and corpus-based linguistics are well developed, it is possible to get a large numbers of data with meaningful tags, to automatically process and tag the text in terms of building the analyzing model. Based on statistical machine learning method and Chinese chunking corpus, the research of this thesis presents the chunking methods. New features which are helpful to effective chunking are involved in the chunking model for the improvement of performance. And integrated chunking model is built with the function of lexical analysis and chunking analysis. The major contents of this thesis include the following four parts:Firstly, the research on the definition of Chinese chunk and corpus construction was carried out. Three types of chunking corpora were present on the basis of different requirements and construction methods. The first method extracted the lowest non-terminals of the parsing tree as chunks. The lowest non-terminals were constituents whose children were all preterminals. It could be utilized in the first step of full parsing. The second method extracted and transformed the chunks from the Upenn Chinese Parsing Treebank. The rules for extraction, transformation and pruning were designed. The algorithm of corpus construction, named Chinese Chunklink, was established. The third method was based on manual tagging. According to the knowledge of linguists, a proper specification for Chinese chunking was compiled from the viewpoint of special chunk annotation of real texts. For different requirements in application, it was a good way to design the suitable chunking corpus with machine learning model.Secondly, the research on the chunking model was carried out. Chunking task was transformed into sequential labeling problem by means of boundary tags and type tags of chunks. Chunking methods were proposed based on generative model and statistical rules. The maximum entropy Markov model was modified to be suitable for the chunking task, while the discriminative chunking model was built. Conditional random fields which overcame shortcomings of the above two types of models were utilized to solve the chunking task. The important issues were the training method, feature engineering and integrated model of conditional random fields. The advantages were analyzed that conditional random fields could merge different types of features, and label the chunks sequentially in solving the chunking task. N-fold template correction post-processing algorithm was introduced for further improving the performance.Thirdly, the research on the features selection in the chunking model brought some important issues. The effect of common features was studied based on the analysis of feature types and feature extraction method in discriminative learning model. New features were investigated to merge into the chunking model for improving the performance. Aiming at the bottleneck of the performance, automatic algorithm which extracted semantic class was designed in terms of Minimum Description Length principle combining with concept similarity computation. Experimental results show that semantic features could improve the chunking performance effectively. Aiming at cascaded chunking errors resulting from automatic part-of-speech tagging, two new features for chunking were designed. One was based on the entropy of information theory and hierarchical clustering algorithm, the other was based on class prior algorithm. These two task-oriented features which were generated from the chunking corpus revealed more powerful ability of chunking prediction. They could effectively avoid the negative impact of automatic part-of-speech tagging.Fourthly, the research on the integrated chunking model was carried out. On the basis of the construction of chunking corpus and model, the integrated chunking model based on cascaded conditional random fields was proposed. The N-best results of part-of-speech tagging were introduced as the input of chunking model. It partially restrained the spread and effect of cascaded chunking errors for the performance. The important issue is the construction of new integrated chunking model. In the new model, task-oriented features replaced the automatic part-of-speech features with name entity and factoid recognition. The new model avoided cascaded errors and improved the performance. It was a new pattern that shorted chunking procedure and saved chunking time. The reliability estimation of chunking results was evaluated by constrained forward-backward decoding algorithm.
Keywords/Search Tags:Chinese Chunking, Conditional Random Fields, Semantic Features, Word Clustering, Integrated Model
PDF Full Text Request
Related items