Research And Implementation Of Chinese Word Segmentation Algorithm

Posted on:2012-10-10

Degree:Master

Type:Thesis

Country:China

Candidate:D S Lin

Full Text:PDF

GTID:2178330332493927

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Chinese Word Segmentation is the process which segments a series of Chinese characters into a reasonable sequence of Chinese words in the light of specific standard. As the basal task of natural language processing, Chinese Word Segmentation has been widely used in related fields. Therefore, research of Chinese Word Segmentation algorithm has important theoretical and practical significance.In order to satisfy practical requirements of the upper application, the method which incorporating organically mechanical word segmentation and statistical word segmentation was proposed in this paper. Firstly, mechanical word segmentation is used as initial segmentation which is much faster than statistical word segmentation. Ambiguities are detected by using improved forward-back maximum matching ambiguity detection method, and are resolved by the Omni-segmentation which based on bigram model. Secondly, the unregistered words are recognized by the method which based on the role of named entity recognition. Finally, rule library is introduced to further correct the segmentation results.Research work of this paper is included as follow:1) Dictionary structure of secondary index adopted to enhance the speed of access, and Java Object Serialization technology is employed to implement dictionary file to load (deserialize) and the dictionary object serialization.2) In ambiguity detection, an improved FBMM (forward back maximum matching) detection algorithm is proposed, which not only can detect an odd chain length of ambiguities, but also all the even chain with one length of cross section. Omni-segmentation is adopted to resolve ambiguity.3) In terms of the unregistered words recognition, forward Viterbi algorithm is used to solve coding problem in the Hidden Markov Model (HMM) is adopted as POS (part-of-speech) tagging and role labeling at first. Then, in the set of role models, pattern string maximum matching method is applied for identifying proper nouns in Chinese. Finally, to further improve the segmentation accuracy, the introduction of a small correction rule library that amends word segmentation "fragmentation".4) Most of the current Chinese Word Segmentation packages are developed by C++ language. As one of the mainstream programming languages, Java is used comparatively less. Thus, on the basis of word segmentation algorithm, Java-based Chinese Word Segmentation system is designed and implemented in this paper.Experimental results show that the segmentation speed of the Chinese Word Segmentation algorithm reaches about 2,100 words per second and segmentation accuracy index F-1 reaches around 95% in CPU3.0 GHZ, memory 2GB environment, which can satisfy the most of the upper application requirements basically.

Keywords/Search Tags:

word segmentation, dictionary, ambiguity reduction, unregistered words, rulelibrary

PDF Full Text Request

Related items

1	Improvement And Implementation Of Chinese Word Segmentation Algorithm Based On Dictionary
2	The Research And Implementation Of The System For Chinese Word Segmentation Base On Dictionary And Statistic
3	Research And Implementation Of Mobile Word Segmentation
4	Reverse Backtracking Research Of Chinese Segmentation Based On Last Word Dictionary
5	Research And Implementation Of Chinese Word Segmentation Based On The Combination Of Statistics And Dictionary
6	Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words
7	Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research
8	Research Of Chinese Word Segmentation Technology Applied In Police Information System
9	Research Into Chinese Word Segmentation Based On Statistic And Regulation
10	The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics