Font Size: a A A

Tfidf-based Text Classification Algorithm Research

Posted on:2007-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2208360185471224Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text classification is one of the important branches in the area of data mining, its mission is to process the text files of which unknown category automatically, distinguish their categories that they belong to in the defined collection of categories. Along with various electronic text files increases exponentially, the applications such as effective information retrieval and information filtering etc. become more and more important and difficult. The text categorization is an effective solution, which has become a practical technique.The techniques about feature selection and classification algorithms involved in text categorization is discussed in this paper, and a thorough research is carried on to them by the method of experiments.Firstly, this paper proposes a new method TDF based on TFIDF by applying traditional feature item weighting function TFIDF to feature selection. We test the performance of the new method by using the classification algorithms of kNN and Naive Bayes. Experiment results show that the proposed method TDF has good effect in feature selection.Secondly, Category information plays an important role in classification. This paper puts forward the classification algorithm TFIDFICF by introducing the factor of class frequency into the algorithm TFIDF. Experiment results show that the added category information can improve the performance of categorization.Thirdly, This paper proposes a novel algorithm, called iterative TFIDFICF, which combines unlabeled data with labeled data to train the TFIDFICF classifier. Experiment results show that this algorithm can...
Keywords/Search Tags:Text categorization, Feature selection, Class frequency, TFIDF algorithm, Cooperative training
PDF Full Text Request
Related items