Font Size: a A A

Research On Chinese Word Segmentation Algorithm Based On Hash And CRF

Posted on:2018-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:F CaoFull Text:PDF
GTID:2348330533959279Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of modern information technology and application,information and data have been increasing at an unprecedented rate.In China,the Chinese information and literature are growing at an unprecedented speed.In such context,this have brought higher demands on the handling and utilizing of Chinese information.In the field of information processing,both English and Chinese,word segmentation is one of the basic and most important work.For example,in the search engine,the word segmentation of the content is often the most critical and basic work,the search terms for the correct word or not directly affect the search results.As the difference between Chinese and English word segmentation,the word is the basic unit of English,while the Chinese characters is the basic unit of Chinese,and the Chinese word boundaries are not exactly defined,so the study of Chinese word segmentation becomes more important.In addition,due to the two characteristics of Chinese word segmentation: complexity and multiple ambiguity.Many scholars are committed to improving the quality of Chinese word segmentation.In this paper,we analyze and study the characteristics of the algorithm of Chinese word segmentation.1)A forward backtracking algorithm based on Hash is proposed,which solves the problem of high complexity caused by backtracking to solve ambiguity.2)In order to solve the embedded problems caused by the named entity identification in the CRF model and some foreign translated words and new new network words problems,a combination method is proposed to improve the effect of named entity recognition.(1)A forward backtracking algorithm based on Hash is proposed.On the basis of the hash dictionary and backtracking mechanism,the algorithm uses a new scanning method to query words,and solves the problem of the longest matching words with hash dictionary.For the discovery and treatment of ambiguity,the backtracking mechanism for the matching double the number of problems,joined the end of marking judgment,this method is compared with other backtracking method to reduce the time complexity.(2)In this paper,a combination algorithm is proposed,which combines the CRF named entity method and the forward maximum matching to improve the recognition effect of named entities.The algorithm is based on the CRF model,the use of basicfeatures,entity list features,boundary features and combination of features to build the corresponding template,and then according to the experiment is good or bad,decide what kind of template;Aiming at the problem that the English name is not accurate,the new term of the network and the small name of the observation window are embedded.It is proposed to establish a common foreign language name dictionary,a new network dictionary and a 5-word organization name dictionary to combine the matching method with the CRF model.Rules are used to correct word segmentation results,thereby improving accuracy and recall rates.(3)In order to verify the feasibility of the proposed algorithm,we use Java and object-oriented programming design ideas to develop and implement the Chinese word segmentation prototype system,in the Eclipse development platform.The final experimental results show that the word segment effect is ideal.
Keywords/Search Tags:Word Segmentation, Positive Maximum Match, Hash, Backtracking, CRF
PDF Full Text Request
Related items