Font Size: a A A

Text Classification Algorithm Based On Attributes Correlation

Posted on:2012-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2178330332489971Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since mass information exists in the form of text, how to classify the large-scale text at a high speed becomes a very urgent problem. To cope with this problem, the automatic text classification comes into being as needed. Automatic text classification combines the statistical method with machine learning theory and includes text to pre-defined classes. It can better solve the problem of classification of mass text information and is widely used. Currently, the research of text classifi-cation is mainly focused on two aspects, i.e., text representation and classifier algorithm. Text de-rived from pretreatment represents space with features of high dimension and sparseness, which results in the decline of classifying performance and efficiency. Text classifiers include Bayesian classifier, k-means, SVM, Neural network and so on. The paper introduces the history, basic con-cepts and process of text classification research; the basic thoughts and theories of the mainstream learning algorithm of text classification; the assessment criteria of text classification and data sets often used. Secondly, machine learning method does not fully consider the semantic information of text, ignoring the relationship between condition attributes and the relationship between condi-tion attributes and decision attributes. This paper shall discuss closely around problems of the rela-tionship between attributes. And on the basis of analyzing text classification research contents and problems existing in the current research, we will focus on three issues. The first is how to ascer-tain the relationship between attributes to enhance the preciseness of text classification; the second is how to improve text classification algorithm to upgrade the performance of classifiers; and the third is, on the basis of the above study and through a large number of experimental comparisons, to verify the effectiveness of algorithm proposed in this paper.This paper is to carry out the following tasks.Firstly, a weighted Naive Bayesian Ensemble (WNBE) classification algorithm based on cor-relation degree of attributes is proposed to improve the classification performance of classifiers. A weight is set to each attribute according to its correlation degree with the decision attribute, and the training data with weighted attributes are sampled to learn member classifiers. The algorithm is tested on 16 UCI datasets, and compared with Naive Bayesian(NB) Classifier, NB net and member classifier which is NB trained based on AdaBoost. The algorithm is also tested on 4 text datasets, and compared with NB Classifier. The results illustrate the ensemble classifier improves the clas-sification performance.Secondly, NB algorithm is a probabilistic method, which is based on an assumption of condi-tional independence and the assumption ignores the correlation between condition attributes. It can result in bad performance. In order to improve classification performance by making use of the correlation between attributes, Naive Bayesian based a pair of attributes is proposed (NBA). The algorithm computing the prior probability of a pair of attributes simultaneously, and takes into ac-count the contribution brought about by the correlation between attributes, which avoids the defect of attribute independence assumption to some extent. Experimental results on 10 UCI benchmark data sets and 4 text data sets show that it outperforms NB.At last, C4.5 algorithm is a top-down one-step greedy search algorithm, which can only find a local optimal solution of classification tasks. In order to increase the probability of finding the global optimum, a novel decision tree construction algorithm adopting two-step forward idea is proposed. The algorithm takes into account the information gain brought about by selecting two attributes simultaneously, rather than considering the information gain caused by just selecting an optimal single attribute, so it is more possible to get the global optimum of a classification task. Experimental results on 10 UCI benchmark data sets show that it outperforms C4.5.
Keywords/Search Tags:BNC, correlation degree, attribute weighting, decision tree, information gain, lo-cal optimum
PDF Full Text Request
Related items