Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Algorithm

Posted on:2012-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:D S LinFull Text:PDF
GTID:2178330332493927Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation is the process which segments a series of Chinese characters into a reasonable sequence of Chinese words in the light of specific standard. As the basal task of natural language processing, Chinese Word Segmentation has been widely used in related fields. Therefore, research of Chinese Word Segmentation algorithm has important theoretical and practical significance.In order to satisfy practical requirements of the upper application, the method which incorporating organically mechanical word segmentation and statistical word segmentation was proposed in this paper. Firstly, mechanical word segmentation is used as initial segmentation which is much faster than statistical word segmentation. Ambiguities are detected by using improved forward-back maximum matching ambiguity detection method, and are resolved by the Omni-segmentation which based on bigram model. Secondly, the unregistered words are recognized by the method which based on the role of named entity recognition. Finally, rule library is introduced to further correct the segmentation results.Research work of this paper is included as follow:1) Dictionary structure of secondary index adopted to enhance the speed of access, and Java Object Serialization technology is employed to implement dictionary file to load (deserialize) and the dictionary object serialization.2) In ambiguity detection, an improved FBMM (forward back maximum matching) detection algorithm is proposed, which not only can detect an odd chain length of ambiguities, but also all the even chain with one length of cross section. Omni-segmentation is adopted to resolve ambiguity.3) In terms of the unregistered words recognition, forward Viterbi algorithm is used to solve coding problem in the Hidden Markov Model (HMM) is adopted as POS (part-of-speech) tagging and role labeling at first. Then, in the set of role models, pattern string maximum matching method is applied for identifying proper nouns in Chinese. Finally, to further improve the segmentation accuracy, the introduction of a small correction rule library that amends word segmentation "fragmentation".4) Most of the current Chinese Word Segmentation packages are developed by C++ language. As one of the mainstream programming languages, Java is used comparatively less. Thus, on the basis of word segmentation algorithm, Java-based Chinese Word Segmentation system is designed and implemented in this paper.Experimental results show that the segmentation speed of the Chinese Word Segmentation algorithm reaches about 2,100 words per second and segmentation accuracy index F-1 reaches around 95% in CPU3.0 GHZ, memory 2GB environment, which can satisfy the most of the upper application requirements basically.
Keywords/Search Tags:word segmentation, dictionary, ambiguity reduction, unregistered words, rulelibrary
PDF Full Text Request
Related items