Font Size: a A A

A Chinese Word Level Segmentation Algorithm Based On Document Category

Posted on:2013-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2248330374997910Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of society informatization, no matter government organizations or companies and schools, they all prefer to store and backup documents via electronic documents. And with the electronic information growth rapidly, it is difficult for people to obtain useful information from the massive information, so it is necessary to do information processing, and word segmentation is just the foundation of Chinese information processing. After10years’research and development, Chinese word segmentation technology have make a great breakthrough, but each existing algorithms has its own disadvantages, such as, low accuracy rate. And due to the complexity of Chinese, there’s not an algorithm can make a trade-off between velocity and accuracy.In order to improve the accuracy of the Chinese word segmentation system, by deep study of the current situation of existing Chinese word segmentation, the commonly used Chinese word segmentation algorithm and a variety of dictionary structure, this paper introduces an improved Chinese word segmentation algorithm. In this paper, a reverse matching segmentation algorithm based on dual-array is introduced. Dual-array dictionary is used as the dictionary structure, which not only inherits the verbatim matching characteristics of the TRIE index tree, but also can save space and improve query efficiency. What’s more, based on the statistics, it conclude that the accuracy rate of the reverse maximum matching is higher than the forward maximum matching under the same condition. So the improved algorithm combines the advantage of dual-array and reverses matching, and the experimental results also prove that the algorithm can obtain a higher velocity.To provide a good application environment for the improved algorithm, this paper designs a document category-based Chinese word level segmentation system. In general, document category is not considered in the segmentation model, while for the knowledge management application system, which category is rich, complexity, and highly specialized, a more targeted segmentation style is needed. In this paper, the document category-based Chinese word segmentation system model is composed of input layer, classification layer, segmentation layer and data layer. There are four different kinds of dictionaries of data layer, that are the basis information dictionary, the specialized dictionary, the core dictionary, here we use the specialized dictionary. The specialized dictionary is specialized, small space occupied, high flexibility, and easy to update. Therefore, unknown words can be replenished timely. According to the category information carried by the text, the system is able to choose the corresponding specialized dictionary to do word segment processing, it can effectively improve the accuracy of specialized vocabulary segmentation, which is also proved by the experiments.
Keywords/Search Tags:Chinese word segmentation, dual-array, reverse matchingalgorithm, word segmentation dictionary, document classification
PDF Full Text Request
Related items