Font Size: a A A

Study On Text Classification Algorithms Based On SVMs

Posted on:2009-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P QinFull Text:PDF
GTID:1118360272470431Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Support vector machines (SVMs), as a new machine learning method based on statistical learning theory, have attracted more and more attention and became a hot issue in the field of machine learning, because they can well resolve such practical problems as nonlinearity, high dimension and local minima. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse, and have numbers of relevant features. SVMs are particularly suited for text categorization and have great potential in text categorization, as SVMs are not sensitive to relevant features and sparse data, and have advantages in dealing with high dimensional problems. However, text categorization is characterized with a high number of classes and training examples, therefore there are still many ongoing research issues to SVMs in text categorization application, such as incremental learning, multi-label classification, and lower speed in training and classification etc. This paper mainly focuses on the drawbacks of SVMs in the practical application including text categorization, and the main work is as follows:1. Multi-label classification algorithms for SVMs are studied. For the training set with more samples and fewer classes, based on 1-a-1 method, a multi-label categorization algorithm is presented. The algorithm uses 1-a-1 method to train fuzzy sub-classifiers. For the sample to be classified, the sub-classifiers are used to obtain the membership matrix, and then the sum of every row of membership matrix are used to confirm the classes the sample. For the training set with fewer samples and more classes, based on 1-a-1 method, a multi-label categorization algorithm is presented. The algorithm uses 1-a-r method to train fuzzy sub-classifiers. For the sample to be classified, the sub-classifiers are used to obtain the membership vector, and then the membership vector is used to confirm the classes of the sample. For the training set with more samples and more classes, based on hyper sphere, a multi-label categorization algorithm is presented. For every class, the hyper-sphere that contains most samples of the class is trained. For the sample to be classified, the distances from it to the centre of every hyper-sphere are used to confirm the classes of the sample. Experimental results indicate the algorithms have better performance on multi-label classification.2. Incremental learning algorithms for SVMs are studied. A weighted class incremental learning algorithm is presented, which improves the CIL algorithm. The algorithm adds the weighs of class to training samples. Experimental results indicate that, compare with CIL algorithm, the method increases precision of the class with fewer samples in the condition that the classification speed does not decrease. Besides, based on hyper sphere SVMs, a new class incremental learning algorithm is presented. The hyper spheres of the new classes are trained, and the primal hyper spheres that they classes exist in new incremental samples are retrained. The class incremental learning is realized in a small training set and a small memory space, the history results are saved at the same time. The algorithm is suitable for both single-label training set and multi-label training set. It is convenience to improvement and extension. Experimental results indicate that the algorithm has a high performance on training speed, classification speed and precision.3. Fast classification algorithm for SVMs is studied. Several existing methods of reducing support vectors set are analyzed. Then, a method of reducing support vectors set is presented, which improves FCSVM algorithm. The method uses dichotomy to select a subset of support vectors. After the transformation on the full set of support vectors, the subset of support vectors is used in classification. The experimental results indicate that, compared with FCSVM algorithm, the method reduces the number of support vectors to the greatest grade and increases classification speed of SVMs in the condition that the correct rate does not decrease.4. Fast training algorithms for fuzzy SVMs are studied. For the training set with a number of samples, a method of working set selection using maximal violating pair for training fuzzy SVMs is proposed. Besides, a method of working set selection using second order information for training fuzzy SVMs is proposed. Experimental results indicate that two methods realize fast training of fuzzy SVMs. Of the two the latter is far better than the former, especially in the case of large number of training samples.
Keywords/Search Tags:Support Vector Machines, Text Categorization, Multi-Label Classification, Incremental Learning, Hyper Sphere
PDF Full Text Request
Related items