Font Size: a A A

Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM

Posted on:2012-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:J L DuanFull Text:PDF
GTID:2178330332490726Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the fast development of Internet technologies, it has transited to the information-rich digital age from a lack of information era. Faced with such a large amount of electronic information, how to get useful information to users in a short time will become a research hotspot. Text classification, which can help users locate the information they need quickly and accurately, is a key technology in data mining. Meanwhile, this technology as the basis for information filtering and search engine, has a broad development prospect and can bring huge economic benefits for society.The task of text classification is to automatically divide the text object according to their specific content into the pre-set category. Its main contents include six modules:text preprocessing, feature selection, feature weighting, text representation, classifier training and testing, and performance evaluation. Feature selection and weighting are very critical for text categorization technology.The main works are shown as follows:(1) The purpose of feature selection is selecting features which largely contribute to the classification from the original high dimensional feature space to represent text. When counting correlation information between feature items and categories, classic feature selection algorithms only consider the ability of feature items for text classification, but not consider the correlation information between feature items. It will lead to synonym feature items that have been selected from the text and classification accuracy has been reduced. In view of this shortcoming, the subject presents a new feature selection algorithm, combination of feature selection algorithm. Firstly, this algorithm selects feature items which largely contribute to the classification by using the weight of evidence; secondly deletes some redundancy feature items for text classification by using the mutual information method.(2) Classical TF-IDF weighting method only considers the frequency of features word and inverse document frequency, but not considers the impact that the category distribution and location distribution of feature items for the classification. So this article presents an improved TF-IDF algorithm based on classical TF-IDF weighting algorithm by combining category distribution and location distribution of feature items.(3) Compared to other machine learning algorithms, SVM method which has strong generalization ability and good convergence is very suitable for text classification. Finally, this paper structures and achieves a SVM-based Chinese text classification system. This system has laid a good foundation for the study of the automatic text classification and performance evaluation. On the basis of the classification system platform, combination of feature selection algorithm and improved weight calculation method are verified to improve the text classification performance.
Keywords/Search Tags:text classification, combination of feature selection algorithm, improved TF-IDF algorithm, category distribution, support vector machine
PDF Full Text Request
Related items