Study Of Text Classification Algorithm Base On Clustering Algorithm And Support Vector Machine Algorithm

Posted on:2013-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2248330362971812

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet technology, the amount of text information is inrapid increase. How to effectively extract these information is most important in informationprocessing, which the main approach now is through text classification. The way to designan effective classification algorithm in text classification process is the key technique toensure fast speed and excellent result. So it is of great significance to have research on thetext classification algorithms, and text classification is the main content in this paper.After analyzing domestic and international researches on text classification, the methodof word segmentation, text feature extraction and text presentation are introduced. Thencommon clustering algorithms and classification algorithms are analyzed in detail. Amongthem, the k-nearest neighbor and support vector machine algorithm are mainly focused. Themain work is listed as follows:First, with deep research on k-nearest neighbor algorithm, a new method to deal withk-nearest neighbor classification boundary problem is proposed in this paper. Firstly, thenew data set was got by cutting the training set using support vector data descriptionalgorithm. Then, standard deviation function is used to judge whether the new training set isstill in imbalance. If the imbalance still exists, the shrinkage factor is brought in to shrinkthe class. And by shrinkage factor we improve the decision function of k-nearest neighbor.Experiments show that the method proposed can effectively solve the boundary problem ink-nearest neighbor text classification, and has higher recall, precision and F1value.Second, after a detailed study of multi-class support vector machine in textclassification, a new method to solve the imbalance and dead zone in one multi-classsupport vector machine is proposed. The method has two advantages, minimizing the regionin which data cannot be classified correctly in one-versus-rest support vector machine, andsolving the imbalance of samples. Firstly, k-means method is used to cluster the training set.Secondly, for each text which hasnâ€™t been clustered correctly, the one-versus-rest methodwas used to generate two types of classification classifier. Then, the dataset which cannot beclassified by one-versus-rest will be trained again using one-versus-one method, this willreduce region of cannot classified and keep the samples in balance. Experiments show thatthe new method is more effective than traditional one-versus-rest method.The main algorithms for text classification are studied in this paper, and then the improved k-nearest neighbor algorithm and support vector machine algorithm are given.Experiments show that the new methods effectively improved the classification results, andhave laid a good foundation for further study.

Keywords/Search Tags:

text classification, kâ€nearest neighbor algorithm, kâ€means method, supportvector machine

PDF Full Text Request

Related items

1	Study On Generalized Nearest Neighbor Pattern Classification
2	Research On Improved K Neighbor Support Vector Machine Algorithm Faced Text Classification
3	Studies On Classification Method Of Celestial Spectral
4	Improved K- Nearest Neighbor Classification
5	Research On Text Representation And Classification Based On Machine Learning Algorithm
6	Nearest Neighbor Classification Improved Algorithm
7	Improved Word Embedding And K-nearest Neighbor Algorithm For Chinese Text Classification
8	Research On Robust Large Margin Classification Learning
9	Research On Text Classification Algorithms Based On Machine Learning
10	Improved K-nearest Neighbor Algorithm And Its Application In Text Analysis