Statistics-based Chinese Automatic Segmentation System

Posted on:2006-09-24

Degree:Master

Type:Thesis

Country:China

Candidate:E Z Duan

Full Text:PDF

GTID:2208360152498495

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

This paper discusses the theory and process of based-statistic Chinese automatic word segmentation system. At first, this paper reviews the progressing history of word segmentation, sums up the objective of word segmentation, analyses problems of word segmentation and sums up research of the theory that the predecessors have obtained on the based-corpus statistic word segmentation Then, basing on analysis of shortcoming of the processing model of Chinese statistic word segmentation, the paper researches an iterative processing model of Chinese statistic word segmentation and brings forword an opinion that language model is trained and optimized while training and optimizing lexicon. Lexicon, which is basis of word segmentation, is very important. For constructing a lexicon that is not only stable but also variable dynamically, a lot of texts are needed to construct an initial lexicon in based-statistic method. PAT tree is used to construct initial lexicon in order to overcome limit of EMS memory and efficiency. This paper defines PAT tree formally, studies structure, working theory and improving measure of PAT tree and discusses the construting arithmetic of PAT tree in detail. Then, this paper studies the structure of the initial lexicon, defines the interior structure of the initial lexicon formally, studies the constructing process of the initial lexicon, including processing of text and construction of PAT tree, discusses the construting arithmetic of the initial lexicon and analyses performance of PAT tree and the initial lexicon . The initial lexicon is processed to reduce size of lexicon and improve efficiency and accuracy of word segmentation. This paper discusses processing principle of the initial lexicon, researches the composition of lexicon, analyses process of the initial lexicon very detailly, discusses principle what lexicon is composed of and studies process of lexicon in detail. Next, this paper segments texts using lexicon. It establishes the principle and approach of word segmentation and analyses mostly merit of solving the problems of intersectional different meanings and recognition of word that is not existed in lexicon. Lexicon and language model are optimized continually because of EMS memory limit. This paper studies iterative optimization of lexicon and language model analyzing mostly how to optimize iteratively lexicon and language to solve the problem of the processing of word segmentation. Then it analyses the whole process of word segmentation taking texts in a special domain as an example, including construction of PAT tree, construction and processing of initial lexicon and lexicon and analysis of processing result. At last, this paper analyses shortcoming of the system and more works in the future. Key Words: Corpus Statistic PAT tree Lexicon Word segmentation...

Keywords/Search Tags:

Corpus, Statistic, PAT tree, Lexicon, Word segmentation

PDF Full Text Request

Related items

1	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-Segmentation With Statistic
2	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-segmentation With Statistic
3	Research On Word Segmentation Based On Probabilistic Model Of Dynamic Lexicon
4	A Word Distributed Representation Approach For Bilingual Lexicon Extraction From Comparable Corpora
5	Emotion Analysis Of Chinese Microblogs Using Extended Emotion Lexicon
6	Comparative Research On Open-Source Chinese Word Segmentation Machines
7	Unknown Words Based On The Corpus Of The Forum Automatically Recognize The New Method
8	Chinese Word Segmentation Using Rule And Statistic
9	The Research On Chinese Word Segmentation System Based On SVM
10	Cascade Consistency Check Of Segmentation Of The Chinese Corpus