Font Size: a A A

Design And Implementation Of A Text Classification System Based On KNN Algorithm

Posted on:2012-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2248330392958240Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since1995, Web technology has entered a stage of rapid development. The InternetWeb page number and service site number increase exponentially. In2004, Internet PIW(publicly indexable Web) page number by an order of1010magnitude, but also hasdaily added8million new page speed. At the same time, the number of Web server can bedoubled in23weeks. Web has become an open, dynamic, global information servicecenter, and an important means of obtaining information. How to extract information froma large number of Web information that people interested in is an important subject in thestudy of modern information.Aiming at the particularity of Web text mining, a text categorization system is designedand implemented. The use of the system’s main purpose is to test and find out the Web textcategorization algorithm’s performance and accuracy. The system mainly includes twomodules of training and classification. Training modules are:(1) Chinese text preprocessinguses the tool package named ICTCLAS, which is developed by CAS Institute of Computing.This module realize the text segmentation and preconditioning functions;(2) featureselection module realizes the document frequency DF, chi-squared feature selection,information gain (IG), mutual information method algorithm;(3) weight calculationmodule realize TF, TF*feature evaluation function value weight algorithm, and establishthe VSM model;(4) classifier module realize the K nearest neighbor methods algorithmwhich is based on statistical, classification module also includes the evaluating classificationresults, and feedbacking the results of evaluation to the training module, so as to improvethe training process continuously.To evaluate the classification accuracy of KNN classification system, we use Sohunews on the Internet to train and classify test. The corpus includes education, sports,environment, entertainment, technology, economy six categories, a total of780texts. Atthe same time, we test the improved KNN algorithm, and then analyze and compare theaccuracy of the algorithm. Experimental data can be used for information retrieval,information filtering, digital libraries and web classification reference.
Keywords/Search Tags:Data mining, Text classification, KNN, Text mining
PDF Full Text Request
Related items