Font Size: a A A

Research And Improvement On Feature Selection And Classification Algorithms For Text Classification Based On KNN

Posted on:2015-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:J J HuangFull Text:PDF
GTID:2268330428460110Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and information technology, the text information increases exponentially. As an important technology of managing large amount of information, text classification is able to solve the problem of chaotic information effectively. Meanwhile, it’s convenient for user to retrieve the required information accurately. Consequently, the text classification possesses high value of application value in the field of information retrieval, classification and filtering mails, tracking topics, etc, having been a hot research field in data mining.Focusing on improving the performance of KNN classifier, this dissertation introduces the definition of text categorization, preprocessing procedure of text, definition and algorithms of feature selection, comparison of traditional and supervised term weighting, text classification algorithms, and performance measurement followed by depth studying and improving the method of feature selection, term weighting and classification.(1) This dissertation put forward the improvement on feature selection on basis of ant colony optimization. By studying and design the fitness function, probabilistic transition rule and pheromone update rule, the improved method can exclude the associated features and redundant features, as well as reduce space and time of calculations effectively, then boost the calculation accuracy, as a result, making the classification performance better than before finally.(2) This dissertation also proposes the improvement of supervised term weighting based on TF-RFIDF. Based on the theory of supervised term weighting of TF-RF, this dissertation proposes the method of TF-RFIDF, combing the relevance frequency and inverse document frequency. It can take advantage of sample distribution and prior information of categories, thus improve the classification performance.(3) This dissertation proposes the improvement of KNN classification algorithm based on association rules. Algorithm of Apriori is used to extract frequent feature set and its associated text of for each category for different types of training samples, so as to determine the appropriate number of neighbor k for unknown class of text, and then determine the category of text according to neighbors’category. The improved algorithm can determine the k value better, and reduce the time complexity.The experimental results at last show the three improved algorithms can improve classification accuracy for text classification, thus proving the effectiveness of the algorithms.
Keywords/Search Tags:Text Classification, Ant Colony Optimization, TF-RFIDF, KNNAlgorithm
PDF Full Text Request
Related items