Font Size: a A A

Corrected-Word-Frequency-based Approach To Word Segmentation Using Unsupervised Learning

Posted on:2007-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:X HuangFull Text:PDF
GTID:2178360185974720Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Chinese automatic word segmentation is a fundamental task in the Chinese information processing, and it plays a key role in the intelligentized Chinese information processing. Simultaneity,with the flourishing development of WWW and E-journal,the Chinese automatic word segmentation will face a series of formidable challenge.One of the leading research goal is how to enhance the adaptability and robustness of the Chinese word segmentation systems in the open condition.Then, the corrected-word-frequency-based approach to word segmentation using unsupervised learning is presented.This dissertation plans to study from the following aspects:First,through introducing the corrected-word-frequency, an automatic unsupervised learning mechanism on word frequency is proposed. It avoids the limitation of traditional method based on dictionary or training corpus and improves the robustness of system, and it reaches the static regrouping of knowledge and mechanism of the dynamic learning and makes system self-adaptive optimized. Simultaneously, a new frame of word segmentation system model is presented.Second, the thesis presents a new method of N-maximum-probability based on context,which adopts learning mechanism based on corrected-word-frequency and attaches word frequencies of the training corpus and real texts. The statistical model avoids the limitation of dictionary and training corpus and improves the robustness of the system, and tries to cover the correct segmentation with as few candidates as possible.Third,for unknown word recognition, this thesis presents a new statistical measure, mt (combination statistic), in terms of a linear and non-linear combination of two common statistical measures,mutual information and difference of t-test, and technology of subgraph extraction. The experimental result shows that mt and technology of subgraph extraction are in favor of the unkown word recognition.Considering the characteristics, the recognition method of unknown word which combines mt and technology of subgraph extraction is proposed.Finally, a new word segmentation algorithm which combines recognition method of unknown word and N-maximum-probability based on context is proposed. The result of elementary experiment using corpus(PD9801 and PD970310 ) which is widely...
Keywords/Search Tags:unsupervised learning, corrected-word-frequency, the context, mt, subgraph extraction
PDF Full Text Request
Related items