The Research And Implementation Of The System For Chinese Word Segmentation Base On Dictionary And Statistic

Posted on:2011-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:H B Li

Full Text:PDF

GTID:2178360305981874

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The technology of Chinese Word Segmentation contains three directions that the Chinese Word Segmentation base on dictionary, the Chinese Word Segmentation base on statistics and the Chinese Word Segmentation base on understandability. Because the last direction is not mature, most systems adopt the strategy which contains dictionary and statistics. However in the most systems, the dictionary and statistics are isolated, dictionary is the basis of mechanical word segmentation, and statistics is used for solving the difficulties of different meanings words and unregistered words.The method in this article is that the dictionary and statistics are incorporate. The results of statistics are the inputs of dictionary, and to the different meanings words and unregistered words, the essence of the algorithm is that extends the dictionary by the method of statistics at first and then uses the method of mechanical word segmentation. My energy and level is limited, we select the research of Chinese Word Segmentation for the domain of computer science.Overall, this system has the following three features. Specific, for the domain of computer science. Efficiency, algorithm core is based on string matching method. High accuracy, we combine simple statistical model and the mechanical sub-word to solve the problem of the ambiguous words and unknown words.The key technologies include the following three parts.First, the design of dictionary. In the overall structure, dictionaries are divided into two structures, the core dictionary and the temporary dictionary. Temporary dictionary is the container for transporting new words between core dictionary and temporary dictionary by statistical methods. The core dictionary is the only standard for the system, sing double hash structure of the core dictionary.Second, the statistical strategy. Ambiguous words and new word identification rely on statistics-based approach, we select the principle of Mutual information theory to statistic the word frequency. The statistical model is simple, easy to be implemented, and have strong practical value.Third, the application of mechanical method. In order to simplify the system structure, improve the efficiency, in the core module, according to Chinese characteristics that the focus is backwards and the rule of "long words priority", we have chosen the algorithm of converse maximum matching.Overall, science the system had initialized, the accuracy remained at 97%, after a certain strength statistical study, the accuracy parameter can improve nearly 1 percentage point, and the efficiency had not changed significantly.

Keywords/Search Tags:

dictionary, statistic, unregistered words, ambiguous words

PDF Full Text Request

Related items

1	Improvement And Implementation Of Chinese Word Segmentation Algorithm Based On Dictionary
2	Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words
3	Research And Implementation Of Chinese Word Segmentation Algorithm
4	Preliminary Study On Statistic Of The Kazak Word Based On Corpus
5	Research On Scene Classification Of LDA Based On Visual Dictionary Capacity Automatic Obtaining
6	Construction Of Visual Dictionary Based On Bag-of-Words
7	Research On Multi-modal Data Processing Methods Of Network Public Opinion Involving Unregistered Words
8	Word sense disambiguation and context
9	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
10	Analysis Of The Expansion Of Private Words To Public Words Space From Emotion Report In The Media