Font Size: a A A

Improvement And Implementation Of Chinese Word Segmentation Algorithm Based On Dictionary

Posted on:2017-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:J Y GuFull Text:PDF
GTID:2428330488976109Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation is the process which segments a series of Chinese charactersinto a reasonable sequence of Chinese words in the light of specific standard.As the basal task of natural language processing,Chinese Word Segmentation has been widely used in relatedfields.Therefore,research of Chinese Word Segmentation algorithm has important theoreticaland practical significance.In order to satisfy practical requirements of the upper application,the method whichincorporating organically mechanical word segmentation and statistical word segmentationwas proposed in this Paper.Firstly,mechanical word segmentation is used as initialsegmentation which is much faster than statistical word segmentation.Ambiguities aredetected byusing improved forward-backmaximum matching ambiguity detection method,and are resolved by the Omnisegmentation whichbasedonbigrammodel.Secondly,the unregistered words are recognized by the method which based on the role of name entityrecognition.Finally,rule library is introduced to further correct the segmentation results.Research work of this Paper is included as follow:Dictionary structure of secondary index adopted to enhance the speed of access,andJava Object Serialization technology is employed to implement dictionary file to load(desterilize)and the dictionary object serialization.In terms of the unregistered words recognition,forward Viterbi algorithm is used tosolve coding Problem in the Hidden Markov Model(HMM)is adopted as POStaggingandrolelabelingatfirst.Then,inthesetofrolemodels,Pattern string maximum matching method is applied for identifying proper nouns in Chinese.Finally,to further improve the segmentation accuracy,the introduction of a small correction rulelibrarythat amends word segmentation "fragmeniation".In ambiguity detection,an improved FBMM(forward back maximum matching)detection algorithm is proposed,which not only can detect an odd chain length of ambiguities,butalsoalltheeven chain withonelengthofcross section.Omni-segmentationis adopted to resolve ambiguity.Most of the current Chinese Word Segmentation Packages aredeveloped by C++language.As one of the mainstream programming languages,Java is usedcomparatively less.Thus,on thebasisofwordsegmentationalgorithm,Java-basedChinese WordSegmentation system is designed and implemented in this paper.
Keywords/Search Tags:dictionary, word segmentation, unregistered words, ambiguity reduction, rule library
PDF Full Text Request
Related items