Font Size: a A A

Design And Realization Of Text Categorization System

Posted on:2009-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y B GaoFull Text:PDF
GTID:2178360242989187Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of computers and the continual development of Internet,more and more electronic documents have become accumulation of huge volumes of data. How to manage these massive data in order to provide facilitation for users to search quickly has become a core of knowledge discovery in databases(KDD).Addressing this issue,text categorization technology has proposed a series of solutions.Text categorization is an important intelligence information processing technology.This technology has high value in information filtering,information retrieval,text databases, digital libraries,and other aspects.The article introduces the text categorization technology and its related algorithms.A text categorization system is designed and implemented.This System is divided into six modules:(1)Text preprocessor,slicing the words in the document,filtering the stop-word;(2)frequency statistics module,counting the frequency of characteristic Words according to various classification algorithms and characteristics of feature selection algorithm;(3)Feature selection module,achieving the information gain(IG),mutual information(Ml),cross-entropy(CE),and X~2 statistics these four feature selection algorithm;(4)weight calculation,achieving a TF,TF-IDF algorithm;(5)the classification algorithm,achieving the Bayesian and K neighbors text classification algorithm;(6)classifier evaluation module,achieving the classification performance's evaluation mechanism from the recall rate,check-rate and value of F1. Three experiments have been done by using this system:k value's affects on the classification in the KNN algorithm,different feature selection algorithms' effects on the classification under KNN algorithm,the comparison of NB algorithm and KNN classification algorithm.Some conclusions have been draw from the experiments.
Keywords/Search Tags:text categorization, vector space model, feature selection, weight calculation, naive Bayesian, K nearest neighbor algorithm
PDF Full Text Request
Related items