Font Size: a A A

The Research And Implementation Of Text Classification System Based On Classified Text Library

Posted on:2012-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:W T HeFull Text:PDF
GTID:2178330332489236Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Document representation is the basis of the text classification system, while the vector space model is currently the most widely used model of text representation. Its basic idea is: after the Chinese word segmentation of the document, every feature obtained is thought as one-dimensional of the coordinate system, with the feature vector space to represent text.Information Gain is a very effective feature selection method. In the process of information gain, the measure of an item's importance is to see how much the information it can bring to the characteristics of classification system. The more information the feature brings into the information system, the more important the feature is.Bayesian classification is a statistical classification method; it is an algorithm of probability and statistics which is used for text classification. In many cases, naive Bayes (Na?ve Bayes, NB) classification algorithm with decision tree and neural network classification algorithm is comparable. While the algorithm can be applied to a large database and the method is a simple, accurate classification of high speed.Vector Space Mode is a method of document representation, which will deal with each article into a high dimensional vector space calculation. Each component represents a term weights, which is handling the conversion of each article into the vector Calculation. This method is simple and effective. The Information Gain of one feature reflects how much information a phrase has brought into the classification system. Although there are many formulas, it is easy to understand the principle. Using this method to get the feature items will well represented all and the Bayesian text classification algorithm is one simple, efficient, fast algorithm, which can well improve the speed and accuracy of classification and it is a very important algorithm for text classification.Therefore, in the training process we choose the Vector Space Model to represent the document, and then calculating the information gain of all the items to choose the feature items. When it gets here, the training process completes. While during the process of classification, we firstly need to use the Vector Space Model to represent the document which will be classified. After that is completed, with the statistical information which is provided by the training module we use Bayesian classification model to achieve the purpose of classifying the document and giving it specific category identification.In this paper, the choice of training text and preprocessing features selection, the founding of the model, calculating the probability of the document belonging to which category, and several other important aspects of the text classification and training data storage, as well as the data structures used to store the training data in the system. All of these not only prove that the algorithm of Bayes text classification is a very efficient classification algorithm, but also do they reflect the effectiveness of selected data structure.
Keywords/Search Tags:Vector Space Model, Information Gain, Bayesian Classifier, Document Classification
PDF Full Text Request
Related items