Font Size: a A A

Design And Inplementation Of A Data Classification System Based On Improved Knn Algorithm

Posted on:2011-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:E B DuFull Text:PDF
GTID:2198330338984195Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
As the information on the Internet increases significantly, the amount of digital text information is also rising. How to effectively manage digital text information has thus become one of the major issues in the domain of information technology. Data classification, as an effective method to improve the speed and accuracy of text data retrieving, plays a vital role in digital text information management.Widely used data classification methods include: Vector Space Model (VSM), K-nearest Neighbors (KNN), Neuron Network (NNet), Support Vector Machines (SVMs) and Bayes. KNN method is a simple, yet effective algorithm and has been widely used.The thesis first introduced the history and current application of data classification, before subsequently move on to the KNN algorithm and its real application. Traditional KNN method is based on unsupervised term weighting, which affects the accuracy in the calculation of distance. The paper focuses on improving the traditional weighting method by adopting the Chi-Square and Information Gain approach, which take advantage of the training data label, and therefore, improve the accuracy of the KNN method. Moreover, the thesis focused on the large amount of calculation demanded by the KNN method, and presented the method of generating representatives of the training set documents to effectively reduce the calculation of the algorithm, and therefore, improve the efficiency of the classifier.At the end, the thesis used Reuters-21578 as both training set and testing set to test the accuracy by adopting both unsupervised term weighting (Boolean weighting and TF-IDF) and supervised term weighting ( x 2and Information Gain). It has proved that supervised term weighting has effectively improved the precision of the KNN algorithm. Besides, the thesis has conducted the experiment to prove that with the newly introduced method of generating representatives of the training set documents, the classifier has less data to calculate and therefore, enjoys a higher efficiency.
Keywords/Search Tags:Data classification, KNN, term weighting
PDF Full Text Request
Related items