Font Size: a A A

Study Of Text Classification Algorithm Base On Clustering Algorithm And Support Vector Machine Algorithm

Posted on:2013-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2248330362971812Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology, the amount of text information is inrapid increase. How to effectively extract these information is most important in informationprocessing, which the main approach now is through text classification. The way to designan effective classification algorithm in text classification process is the key technique toensure fast speed and excellent result. So it is of great significance to have research on thetext classification algorithms, and text classification is the main content in this paper.After analyzing domestic and international researches on text classification, the methodof word segmentation, text feature extraction and text presentation are introduced. Thencommon clustering algorithms and classification algorithms are analyzed in detail. Amongthem, the k-nearest neighbor and support vector machine algorithm are mainly focused. Themain work is listed as follows:First, with deep research on k-nearest neighbor algorithm, a new method to deal withk-nearest neighbor classification boundary problem is proposed in this paper. Firstly, thenew data set was got by cutting the training set using support vector data descriptionalgorithm. Then, standard deviation function is used to judge whether the new training set isstill in imbalance. If the imbalance still exists, the shrinkage factor is brought in to shrinkthe class. And by shrinkage factor we improve the decision function of k-nearest neighbor.Experiments show that the method proposed can effectively solve the boundary problem ink-nearest neighbor text classification, and has higher recall, precision and F1value.Second, after a detailed study of multi-class support vector machine in textclassification, a new method to solve the imbalance and dead zone in one multi-classsupport vector machine is proposed. The method has two advantages, minimizing the regionin which data cannot be classified correctly in one-versus-rest support vector machine, andsolving the imbalance of samples. Firstly, k-means method is used to cluster the training set.Secondly, for each text which hasn’t been clustered correctly, the one-versus-rest methodwas used to generate two types of classification classifier. Then, the dataset which cannot beclassified by one-versus-rest will be trained again using one-versus-one method, this willreduce region of cannot classified and keep the samples in balance. Experiments show thatthe new method is more effective than traditional one-versus-rest method.The main algorithms for text classification are studied in this paper, and then the improved k-nearest neighbor algorithm and support vector machine algorithm are given.Experiments show that the new methods effectively improved the classification results, andhave laid a good foundation for further study.
Keywords/Search Tags:text classification, k‐nearest neighbor algorithm, k‐means method, supportvector machine
PDF Full Text Request
Related items