Font Size: a A A

The Research On Chinese Word Segmentation System Based On SVM

Posted on:2008-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhuFull Text:PDF
GTID:2178360215985900Subject:Traffic Information Engineering & Control
Abstract/Summary:PDF Full Text Request
Statistical learning theory (SLT) focuses on statistic laws and learning methods. It builds a good theoretical framework and creates a new generic learning algorithm support vector machines(SVM). Based on the theory, methods and application of SVM, this dissertation is mainly focus on the application of SVM used for Chinese word segmentation. The main contents and innovations are as follows:1. Determine the requirements of the simple input and select the kernel function and parameters through the research on SVM and according to SVM classification. Based on the analysis of the distribution of inaccurate samples during SVM classification, and combined with the other classification methods such as KNN, a higher accuracy classifier is proposed. Also solve the problem of the selection for kernel functions when applying SVM to classifier, and can be applied to various fields.2. Introduce the application of SVM for Chinese word segmentation which is based on the statistic of frequency of the words. Segment the input Chinese sentences and output character string which is usually two character word bunch, and create a dictionary. The dictionary stores word and the frequency that the word appears in these disposal texts. The mutual information is used for statistics. Compared with the traditional word segmentation methods, the method used SVM algorithm can improve the segmentation accuracy and with a certain degree of stability.3. Based on the SVM, this dissertation uses the method which combined with the KNN algorithm to further segment to the ambiguity words in Chinese Segment for the samples easily having errors. This will improve the efficiency of classifying. Meanwhile, mutual Information, N-Gram, t-test are used to express ambiguity words. And after analyzing the effects to the accuracy of segment, the better expressing method is used to improve the accuracy of SVM.
Keywords/Search Tags:svm, K nearst neighbour algorithm, Chinese word segmentation, statistic the frequency of the word, mutual information, N-Gram, t-test
PDF Full Text Request
Related items