A Chinese Word Level Segmentation Algorithm Based On Document Category

Posted on:2013-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhao

Full Text:PDF

GTID:2248330374997910

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of society informatization, no matter government organizations or companies and schools, they all prefer to store and backup documents via electronic documents. And with the electronic information growth rapidly, it is difficult for people to obtain useful information from the massive information, so it is necessary to do information processing, and word segmentation is just the foundation of Chinese information processing. After10years’research and development, Chinese word segmentation technology have make a great breakthrough, but each existing algorithms has its own disadvantages, such as, low accuracy rate. And due to the complexity of Chinese, there’s not an algorithm can make a trade-off between velocity and accuracy.In order to improve the accuracy of the Chinese word segmentation system, by deep study of the current situation of existing Chinese word segmentation, the commonly used Chinese word segmentation algorithm and a variety of dictionary structure, this paper introduces an improved Chinese word segmentation algorithm. In this paper, a reverse matching segmentation algorithm based on dual-array is introduced. Dual-array dictionary is used as the dictionary structure, which not only inherits the verbatim matching characteristics of the TRIE index tree, but also can save space and improve query efficiency. What’s more, based on the statistics, it conclude that the accuracy rate of the reverse maximum matching is higher than the forward maximum matching under the same condition. So the improved algorithm combines the advantage of dual-array and reverses matching, and the experimental results also prove that the algorithm can obtain a higher velocity.To provide a good application environment for the improved algorithm, this paper designs a document category-based Chinese word level segmentation system. In general, document category is not considered in the segmentation model, while for the knowledge management application system, which category is rich, complexity, and highly specialized, a more targeted segmentation style is needed. In this paper, the document category-based Chinese word segmentation system model is composed of input layer, classification layer, segmentation layer and data layer. There are four different kinds of dictionaries of data layer, that are the basis information dictionary, the specialized dictionary, the core dictionary, here we use the specialized dictionary. The specialized dictionary is specialized, small space occupied, high flexibility, and easy to update. Therefore, unknown words can be replenished timely. According to the category information carried by the text, the system is able to choose the corresponding specialized dictionary to do word segment processing, it can effectively improve the accuracy of specialized vocabulary segmentation, which is also proved by the experiments.

Keywords/Search Tags:

Chinese word segmentation, dual-array, reverse matchingalgorithm, word segmentation dictionary, document classification

PDF Full Text Request

Related items

1	Reverse Backtracking Research Of Chinese Segmentation Based On Last Word Dictionary
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-Segmentation With Statistic
4	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-segmentation With Statistic
5	A Dictionary And Statistics-based Chinese Word Segmentation Algorithm
6	Chemical Dictionary Of Structural Design And Development Of Chinese Word Segmentation System
7	The Research And Implementation Of Automatic Chinese Word Segmentation System
8	The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics
9	Improvement And Implementation Of Chinese Word Segmentation Algorithm Based On Dictionary
10	Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words