Font Size: a A A

Chinese Word Segmentation System Based On Statistics

Posted on:2011-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiFull Text:PDF
GTID:2178360305988621Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, Chinese information processing has been developed significantly in various computer fields, however, the Chinese word segmentation is the foundation of Chinese information processing, as the term is used to connect sentences and information processing platform in the middle part, consequently, Chinese word processing result of a direct impact on the accuracy of Chinese information processing, becoming the bottleneck to Chinese information processing platform processing capability.The thesis is a review on the automatic Chinese word processing status, principles, processes, evaluation indicators, and domestic and abroad development status. According to various Word segmentation algorithm I did a lot of deep study and research and made some suggestion for the improvement after the analysis of the current word segmentation algorithm's advantages and disadvantages. The use of support vector machines with the vector space model for the establishment of a new CWSSBS. Owing to the support vector machine with a limited training samples can be established under the terms of a complex sub-model and to achieve a strong self-learning ability. And the use of inverted dictionary to ensure that commonly used and the latest new words are not logged at the highest priority status. Therefore, the ability of the improved CWSSBS Chinese word the system automatically logged to learn new words has been effectively improved. In the support vector machine under the influence of self-learning function of a dictionary, therefore, it can make the system has a high adaptability as well as the unfamiliar environment and strong portability. And in manual and machine monitoring mechanism of the intervention can be timely and correct errors in auto-learning. In the ambiguity processing section, there is an improved matching and reverse matching the positive combination of ambiguity acquisition method. In the ambiguity treatment process, using the longest word into the field of law to ensure the handling of ambiguous, reaching its maximum extent in the purpose of eliminating ambiguities.Through the simulation analysis of the system results, we can see the the improved WSSBS compares with the original system has been improved a lot in the ambiguity problem-solving and self-learning function of the dictionary. However, due to the time and environmental conditions, it needs further research and improvement in the future.
Keywords/Search Tags:statistical word segmentation, support vector machines, intelligent learning
PDF Full Text Request
Related items