Font Size: a A A

Exploring Method Of Domain Adaptation For Chinese Segmentation Based On Active Learning

Posted on:2016-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:H T XuFull Text:PDF
GTID:2308330470455817Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation refers to automatically add spaces or other boundary markers between words in Chinese texts by using computer system. The study is an important branch in the field of natural language processing. The method based on dictionary, the method based on rules, and the methods based on statistic are three main methods of the Chinese word segmentation. All these methods have solid theoretical basis and sophisticated segmentation models, which can be applied in different tasks of natural language processing according to their different applications and characteristics.However, once the text segmentation system that trained from particular text field is applied in a new text field, the accuracy is usually decreased significantly. This is because new field text contains many words that are not included in the system. Moreover, different text fields have different rules of forming words. Building Word segmentation system for each and every field requires a huge number of training corpus. However, it requires a lot of labor cost and time cost, which is difficult to implement in reality.This paper puts forward Chinese segmentation method based on Active learning in the field of adaptive Chinese segmentation, because the large-scale artificial training data are difficult to obtain segmentation problems. The main idea of this method is based on the analysis of the target text, choosing a small scale with more knowledge of the language corpus, manual tagging, and then training segmentation model of target area with the training corpus. It achieves the goal which manual tagging the small corpus annotation for teaching large corpus. The main work of this paper includes the following four aspects:(1) The adaptive Chinese word segmentation method based on Active learning effectively improves the accuracy of the Word segmentation through a small amount of annotation. Designing and implementing Chinese segmentation system for specific areas and expanding the existing domain adaptive Chinese word segmentation methods.(2) Developing the Science and technology Chinese word segmentation artificial labeling standards by using science and technology field as an experimental subject of domain adaptive ability. Regarding CTB segmentation standards as a basis, analyzing the representative literature statements of science and technology based on CTB segmentation standard, increases the word segmentation standard of proper nouns in this field.(3) The word segmentation technology in the field of text corpus, to verify the effectiveness of the proposed method. The evaluation standards are accuracy rate, recall rate and F value, which is in a quantitative way of the evaluation. At the same time, designing experiments, for analyzing of the relationship between the number of manual annotation and segmentation system performance data, provide data support for the construction of language model in the field of adaptive.(4) In order to verify the effect of application of technology in the field of word segmentation, recognizing the proper nouns from the parallel corpus in Chinese English literature of the field of science and technology and constructing Chinese English translation dictionary. Specifically, with the Chinese word segmentation system and GIZA++, Moses and other tools for processing, completing the task of extracting phrase and constructing domain translation dictionary.In summary, this dissertation puts forward the Active learning algorithm to extend the adaptive segmentation system for formal fields, in order to enhance the adaptive ability of Chinese Word segmentation system. The science and technology experiment results show that the proposed method is able to enhance the domain adaptive ability of Chinese word segmentation and to improve the accuracy of Chinese word segmentation.
Keywords/Search Tags:Chinese word segmentation, adaptive field, Active learning, naturallanguage processing
PDF Full Text Request
Related items