Font Size: a A A

Study On Recognition Of Chinese Proper Noun

Posted on:2007-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:T T MaoFull Text:PDF
GTID:2178360212957107Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese proper noun recognition is an important technique to improve the accuracy of segmentation. The main task of this paper is studying and implementing the effective approach of extracting proper noun from Chinese texts.Based on the research and analysis of current identification methods for Chinese proper noun, this paper sets up a model based on support vector machine(SVM) to identify Chinese proper noun, and presents four different methods to improve the performance of SVMs, the first is the corresponding algorithm combining SVM with statistical method, the second is modified SVM and K nearest neighbors(KNN) algorithm, the third is modified SVM algorithm, the fourth is cluster SVM algorithm.Analyzing the classification results obtained by sole SVM, the misclassified testing samples by SVM are mostly near the decision plane. In order to increase the accuracy of SVM, a hybrid model combining SVM with a statistical approach for Chinese proper noun is proposed, which is, in the region near the decision plane, statistical method is used to classify the samples instead of SVM, and in the region far away from the decision plane, SVM is used.A modified SVM-KNN classifier combined SVM with modified KNN is presented in the same way. Different classifiers are used for classifying the different test samples in spatial distributions. To fit the unbalanced data, a modified KNN classifier is proposed to modify classic KNN.Because of the unbalance of the training set (the negative samples are significantly outnumbered by the positive ones), which worsens the performance of SVM, a modified SVM classifier to identify Chinese proper noun is proposed. A algorithm called boundary movement is used to modify SVM.Cluster SVM algorithm is also proposed in order to reduce classification mistakes caused by the unbalance of the number of two kinds of samples in training set. In this algorithm, the training set was clustered using the kernel-based K-means clustering, thus a machine learning model is set up using SVM algorithm to the training set that has been clustered.In this paper, firstly, according to the characteristics of Chinese proper noun, words in the texts were segmented and assigned part-of-speech(POS) tags, a training set is constructed by extracting features of vectors. Secondly, four Chinese proper noun recognizing models are set up based on the above four methods. Lastly, the final identification results of the testing...
Keywords/Search Tags:Chinese Proper Noun, Statistical Method, Modified SVM-KNN, Modified SVM, Clustering
PDF Full Text Request
Related items