Corrected-Word-Frequency-based Approach To Word Segmentation Using Unsupervised Learning

Posted on:2007-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Huang

Full Text:PDF

GTID:2178360185974720

Subject:Computer software and theory

Abstract/Summary:

The Chinese automatic word segmentation is a fundamental task in the Chinese information processing, and it plays a key role in the intelligentized Chinese information processing. Simultaneity,with the flourishing development of WWW and E-journal,the Chinese automatic word segmentation will face a series of formidable challenge.One of the leading research goal is how to enhance the adaptability and robustness of the Chinese word segmentation systems in the open condition.Then, the corrected-word-frequency-based approach to word segmentation using unsupervised learning is presented.This dissertation plans to study from the following aspects:First,through introducing the corrected-word-frequency, an automatic unsupervised learning mechanism on word frequency is proposed. It avoids the limitation of traditional method based on dictionary or training corpus and improves the robustness of system, and it reaches the static regrouping of knowledge and mechanism of the dynamic learning and makes system self-adaptive optimized. Simultaneously, a new frame of word segmentation system model is presented.Second, the thesis presents a new method of N-maximum-probability based on context,which adopts learning mechanism based on corrected-word-frequency and attaches word frequencies of the training corpus and real texts. The statistical model avoids the limitation of dictionary and training corpus and improves the robustness of the system, and tries to cover the correct segmentation with as few candidates as possible.Third,for unknown word recognition, this thesis presents a new statistical measure, mt (combination statistic), in terms of a linear and non-linear combination of two common statistical measures,mutual information and difference of t-test, and technology of subgraph extraction. The experimental result shows that mt and technology of subgraph extraction are in favor of the unkown word recognition.Considering the characteristics, the recognition method of unknown word which combines mt and technology of subgraph extraction is proposed.Finally, a new word segmentation algorithm which combines recognition method of unknown word and N-maximum-probability based on context is proposed. The result of elementary experiment using corpus(PD9801 and PD970310 ) which is widely...

Keywords/Search Tags:

unsupervised learning, corrected-word-frequency, the context, mt, subgraph extraction

Related items

1	Unsupervised Relation Extraction Based On Matrix Factorization
2	Research On Unsupervised Cross-lingual Mappings Of Word Embeddings
3	Research On Graph-based Keyphrase Extraction Integrating Multiple Attributes
4	A Chinese Unsupervised Word Sense Disambiguation Method Based On Semantic Vector
5	Context Computing Applications, Word Disambiguation
6	Research On Exact Subgraph Search Technology In Graph Database
7	On Learning Classifier System Clustering And Backbone Extraction Methods Under Unsupervised Learning Framework
8	An Unsupervised Approach To Word Sense Disambiguation Based On Second-order Context
9	Research On Enhancement Of Isomorphism Of Word Embeddings For Dictionary Extraction
10	Unsupervised Cross-lingual Word Representation Learning Method Based On Co-training