Font Size: a A A

Statistics-based Chinese Automatic Segmentation System

Posted on:2006-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:E Z DuanFull Text:PDF
GTID:2208360152498495Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This paper discusses the theory and process of based-statistic Chinese automatic word segmentation system. At first, this paper reviews the progressing history of word segmentation, sums up the objective of word segmentation, analyses problems of word segmentation and sums up research of the theory that the predecessors have obtained on the based-corpus statistic word segmentation Then, basing on analysis of shortcoming of the processing model of Chinese statistic word segmentation, the paper researches an iterative processing model of Chinese statistic word segmentation and brings forword an opinion that language model is trained and optimized while training and optimizing lexicon. Lexicon, which is basis of word segmentation, is very important. For constructing a lexicon that is not only stable but also variable dynamically, a lot of texts are needed to construct an initial lexicon in based-statistic method. PAT tree is used to construct initial lexicon in order to overcome limit of EMS memory and efficiency. This paper defines PAT tree formally, studies structure, working theory and improving measure of PAT tree and discusses the construting arithmetic of PAT tree in detail. Then, this paper studies the structure of the initial lexicon, defines the interior structure of the initial lexicon formally, studies the constructing process of the initial lexicon, including processing of text and construction of PAT tree, discusses the construting arithmetic of the initial lexicon and analyses performance of PAT tree and the initial lexicon . The initial lexicon is processed to reduce size of lexicon and improve efficiency and accuracy of word segmentation. This paper discusses processing principle of the initial lexicon, researches the composition of lexicon, analyses process of the initial lexicon very detailly, discusses principle what lexicon is composed of and studies process of lexicon in detail. Next, this paper segments texts using lexicon. It establishes the principle and approach of word segmentation and analyses mostly merit of solving the problems of intersectional different meanings and recognition of word that is not existed in lexicon. Lexicon and language model are optimized continually because of EMS memory limit. This paper studies iterative optimization of lexicon and language model analyzing mostly how to optimize iteratively lexicon and language to solve the problem of the processing of word segmentation. Then it analyses the whole process of word segmentation taking texts in a special domain as an example, including construction of PAT tree, construction and processing of initial lexicon and lexicon and analysis of processing result. At last, this paper analyses shortcoming of the system and more works in the future. Key Words: Corpus Statistic PAT tree Lexicon Word segmentation...
Keywords/Search Tags:Corpus, Statistic, PAT tree, Lexicon, Word segmentation
PDF Full Text Request
Related items