Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Based On The Combination Of Statistics And Dictionary

Posted on:2016-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhouFull Text:PDF
GTID:2308330503451180Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Chinese text segmentation is an important research direction in Chinese natural language research, the speed of word segmentation can directly affect the subsequent application, the accuracy of the segmentation can directly affect the corresponding research, so a successful Chinese word segmentation method should has high accuracy and fast segmentation of ability. Because of the complexity of the Chinese themselves, how to achieve accurate and fast segmentation has been a problem of Chinese natural language processing.This paper first introduce the research status quo and the application scope of Chinese word segmentation technology, introduces the comparison of several common Chinese word segmentation method and analyzes the advantages and disadvantages of them, analysis of the current related research main difficulties in Chinese word segmentation. Mechanical word segmentation method based on dictionary can be quickly on the Chinese text segmentation, but lack of this method is limited to the dictionary obviously, the word segmentation accuracy, ambiguity processing ability and the ability to deal with new words can achieve satisfactory results. Conditional random field model is able to find and identify new word, bidirectional maximum matching method and t-test can be good deal with ambiguity segmentation problem.The Chinese word segmentation based on combination of the statistics and dictionary method is put forward, CRF method identify the advantage of the new word, t- test and two-way maximum matching method of ambiguity resolution and speed of segmentation combined with the advantage of; Decoding part of the conditional random field algorithm was improved and used mechanical word segmentation method for full text segmentation, and then calculated according to threshold value and the standard template to judge whether a new word, also can accelerate the decoding speed on average level; Because statistical word segmentation dictionary is an important step, this paper proposes a dictionary based on Hash organization, the dictionary based on Hash stored the same amount of words need to take up less space, it is a dictionary structure with more efficient and it also can facilitate mechanical participle word lookup and found the new words after the quick add of dictionary.This article uses Bakeoff international Chinese automatic word segmentation evaluation corpus for training and experiment, the experiment proves the method in this paper is effective, it can complete the feasibility of the new word discovery task, will get the evaluation standard of experiment and other methods for the lateral comparison and analysis, proves that the method is a feasible and effective segmentation methods, it can have faster speed of word segmentation and word segmentation accuracy is higher.
Keywords/Search Tags:Chinese text segmentation, The new words found, Ambiguity processing, Conditions random fields
PDF Full Text Request
Related items