Font Size: a A A

Research And Implementation Of Text Classification Technology Based On Bayesian Theory

Posted on:2010-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2178360272497025Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet makes people get information more than ever before, it is convenient for people's access to information, but the rapid increase in the information has increased the difficulty of access to effective information, it is too hard for people to find their real needs in the mass of information. A solution is to classify the information, it can reduce the scope of information which people find, so it will improve the search efficiency. However, the speed of information generated is much faster than the speed of people to gather them today, so the traditional way of manual text classification is becoming increasingly difficult to achieve. Automatic text classification technology becomesthe focus on the study in this urgent demand.Automatic text classification learns the classificatory information which people has classified at first, then it analyses the similar identity in the texts of the entire category, finallyit summarizes the distinctions between categories. When the text comes to be classified, the classifier compares its character with all the classificatory character, and then it will be assigned to the category with its nearest characteristic. Automatic text classification can help people collect and choose their required information effectively, and also it can discover new concepts and automatically analyze the relationship between them in the growing mass of information, so it will be come ture to automated process information.Text classification combines the knowledge of machine learning and information retrieval, and it communicates with the areas of text information extraction and text mining, therefore text classification involves many technologies, this article will study these technologies one by one in detail.Text classification expresses the text through extracting valuable information in it, thus it greatly reduces the amount of data in text processing, at the same time it avoids the interference of noises, therefore the primary task of text classification is to divide the text and express it as words string set. Text pre-processing is the technology to complete this work, it first makes use of analyzer to partition the words in the text, and then compare with the stopword extracted before to remove the empty word and the word which is difficult to distinguish category's information. Finally, text pre-processing deals with the repeated word and expresses the text as words string set.There is still a very large amount of data in the text's words string set after text pre-processing, so it is necessary to compress the characteristics that it further extract the valuable information in the text. Characteristics compression includes feature selection and feature extraction. Feature selection usually use some sort of evaluation function to assign marks for every feature, and then they are sorted according to the score from high to low,finally a number of features of the highest scores are selected as the most valuable information in the text. Feature extraction is based on the principle of a transformation from the original feature space to a low-dimensional feature space, and the original feature space which contains the classified information is mapped to the new low-dimensional feature space in order to reduce the amount of features. In this paper, the conventional methods of characteristics compression have been studied one by one, the methods of feature selectioninclude Document Frequency, Information Gain, Expected Cross Entropy, MutualInformation, chi-square Statistic, and Weight of Evidence for Text, then the methods of feature extraction include Principle Component Analysis, Latent Semantic Index, and Fisher Linear Discriminate Analysis.The text is expressed as keywords string set after characteristics compression, the next task is transforming the keywords string set to the form that the computer can handle. In this paper, the text set is expressed as vector space model, vector space model expresses a text as a vector in the Euclidean space. Vector space model extract the common features of the textin the same text set, a common feature is corresponding to one dimension in the vector space, and the value of item is the weigh of listed feature in the lined text. The weigh of featurereflects the capability of labeling the text and distinguishing others category, this paper introduces the conventional methods of weight calculation, the methods contain Boolean Weighting, Term Frequency Weighting, Term Frequency-Inverse Document FrequencyWeighting, and Entropy Weighting.After expressing the texts set in a category as vector space model, the next task is to choose a method of classification to build a text classifier. This paper study conventionalmethods of classification in detail, they include Rocchio algorithm, K Nearest Neighbor,Bayesian Classification Algorithm, Support Vector Machine, Artificial Neural Networks,Decision Tree, Association Rule Classification Algorithm, and Rough Set. BayesianClassification Algorithm with clear structure, strong Anti-interference ability and capable to deal with incomplete data have become the most widely applied classification method.Therefore, this article focuses on the Bayesian classification method.Bayesian classification method is a statistical classification method based on the classic Bayesian probability theory, it has strong abilities of model representation, study and illation, so it achieves satisfactory classification results. Naive Bayesian classification method is the most representative method in Bayesian classification method, it needs only one scan of data,and has strong anti-interference and self-correction ability, so it classifies fast and has high classification accuracy. Naive Bayesian classification method needs to calculate the conditional probability of the text under each category, since the method is built on thecategory conditional independent assumption, the conditional probability of the text under each category is converted to the product of all the conditional probability of the feature under each category in the text, and the conditional probability of the feature under eachcategory can be calculated by the training set. After figuring out the conditional probability of the text under each category, the method thinks that the text is belong to the category which has the maximal conditional probability. Based on the above idea of Naive Bayesian classification method, this paper designed and implemented a classification system for English texts.This classification system uses the method of term frequency to calculate the conditional probability of the feature under each category in the training phase, and records the results to the conditional probability list of each category. When calculating the conditional probability of the text under each category in the sorting phase, this classification system only needs to query the conditional probability of the feature under the category in the each conditional probability list, it doesn't need to calculate it again, so that will greatly enhance the efficiency of this classification system.Finally, this paper made a performance evaluation to the classifier in this classification system, it used the indicators of precision, recall and F1-measure respectively to evaluate the classification results, experiment showed that this classification system has a higher classification accuracy.
Keywords/Search Tags:Text Classification, Naive Bayes, Vector Space Model, Feature Selection and Extraction, F1-Measure
PDF Full Text Request
Related items