Font Size: a A A

The Research And Implementation Of The System For Chinese Word Segmentation Base On Dictionary And Statistic

Posted on:2011-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:H B LiFull Text:PDF
GTID:2178360305981874Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The technology of Chinese Word Segmentation contains three directions that the Chinese Word Segmentation base on dictionary, the Chinese Word Segmentation base on statistics and the Chinese Word Segmentation base on understandability. Because the last direction is not mature, most systems adopt the strategy which contains dictionary and statistics. However in the most systems, the dictionary and statistics are isolated, dictionary is the basis of mechanical word segmentation, and statistics is used for solving the difficulties of different meanings words and unregistered words.The method in this article is that the dictionary and statistics are incorporate. The results of statistics are the inputs of dictionary, and to the different meanings words and unregistered words, the essence of the algorithm is that extends the dictionary by the method of statistics at first and then uses the method of mechanical word segmentation. My energy and level is limited, we select the research of Chinese Word Segmentation for the domain of computer science.Overall, this system has the following three features. Specific, for the domain of computer science. Efficiency, algorithm core is based on string matching method. High accuracy, we combine simple statistical model and the mechanical sub-word to solve the problem of the ambiguous words and unknown words.The key technologies include the following three parts.First, the design of dictionary. In the overall structure, dictionaries are divided into two structures, the core dictionary and the temporary dictionary. Temporary dictionary is the container for transporting new words between core dictionary and temporary dictionary by statistical methods. The core dictionary is the only standard for the system, sing double hash structure of the core dictionary.Second, the statistical strategy. Ambiguous words and new word identification rely on statistics-based approach, we select the principle of Mutual information theory to statistic the word frequency. The statistical model is simple, easy to be implemented, and have strong practical value.Third, the application of mechanical method. In order to simplify the system structure, improve the efficiency, in the core module, according to Chinese characteristics that the focus is backwards and the rule of "long words priority", we have chosen the algorithm of converse maximum matching.Overall, science the system had initialized, the accuracy remained at 97%, after a certain strength statistical study, the accuracy parameter can improve nearly 1 percentage point, and the efficiency had not changed significantly.
Keywords/Search Tags:dictionary, statistic, unregistered words, ambiguous words
PDF Full Text Request
Related items