Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM

Posted on:2012-12-20

Degree:Master

Type:Thesis

Country:China

Candidate:J L Duan

Full Text:PDF

GTID:2178330332490726

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the fast development of Internet technologies, it has transited to the information-rich digital age from a lack of information era. Faced with such a large amount of electronic information, how to get useful information to users in a short time will become a research hotspot. Text classification, which can help users locate the information they need quickly and accurately, is a key technology in data mining. Meanwhile, this technology as the basis for information filtering and search engine, has a broad development prospect and can bring huge economic benefits for society.The task of text classification is to automatically divide the text object according to their specific content into the pre-set category. Its main contents include six modules:text preprocessing, feature selection, feature weighting, text representation, classifier training and testing, and performance evaluation. Feature selection and weighting are very critical for text categorization technology.The main works are shown as follows:(1) The purpose of feature selection is selecting features which largely contribute to the classification from the original high dimensional feature space to represent text. When counting correlation information between feature items and categories, classic feature selection algorithms only consider the ability of feature items for text classification, but not consider the correlation information between feature items. It will lead to synonym feature items that have been selected from the text and classification accuracy has been reduced. In view of this shortcoming, the subject presents a new feature selection algorithm, combination of feature selection algorithm. Firstly, this algorithm selects feature items which largely contribute to the classification by using the weight of evidence; secondly deletes some redundancy feature items for text classification by using the mutual information method.(2) Classical TF-IDF weighting method only considers the frequency of features word and inverse document frequency, but not considers the impact that the category distribution and location distribution of feature items for the classification. So this article presents an improved TF-IDF algorithm based on classical TF-IDF weighting algorithm by combining category distribution and location distribution of feature items.(3) Compared to other machine learning algorithms, SVM method which has strong generalization ability and good convergence is very suitable for text classification. Finally, this paper structures and achieves a SVM-based Chinese text classification system. This system has laid a good foundation for the study of the automatic text classification and performance evaluation. On the basis of the classification system platform, combination of feature selection algorithm and improved weight calculation method are verified to improve the text classification performance.

Keywords/Search Tags:

text classification, combination of feature selection algorithm, improved TF-IDF algorithm, category distribution, support vector machine

PDF Full Text Request

Related items

1	Research On Text Classification Based-on Support Vector Machine
2	Research On Text Emotion Classification Based On Improved Feature Selection Method
3	Chinese Text Classification Based On Svm Algorithm Realization
4	Research On Improved K Neighbor Support Vector Machine Algorithm Faced Text Classification
5	A Study Of Subject Web Classification Algorithm Based On Machine Learning
6	Research On Text Classification Algorithm Based On Support Vector Machine And Neural Network
7	Research On Text Classification System Based On Support Vector Machine
8	Chinese Text Classification Algorithm
9	The Design And Application Of SSVM's Text Classification Based On Feature Selection Optimization
10	Research On Algorithm Of Support Vector Machine Text Classification Based On Improved Density Clustering