Font Size: a A A

Research On Centroid-based Document Classification Algorithms

Posted on:2014-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:X T ZhouFull Text:PDF
GTID:2248330395497504Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and the widespread popularization of Internettechnology, human society has changed from the deficient information age to theextremely abundant information age. The demand of the Internet user to obtain therequired text information in a timely manner and organize the obtained textinformation systematically is on the increase. How to classify, manage and organizemassive text information automatically, efficiently and systematically has become animportant research topic of machine learning and information retrieval. Therefore, theresearch of automatic text classification has important significance as it is antechnology which can divide texts into predefined categories and improve the speedof information retrieval.Among the numerous text categorization algorithms, centroid-based algorithm isa widely used and studied text classification algorithm because of its simple coding,efficient computation and better performance. However, centroid-based algorithm alsohas its own shortcomings. Researchers have found that centroid-based algorithm islimited by an inductive bias caused by its own assumption, which lead to a bad impacton its classification performance. In order to correct the inductive bias and improvethe classification performance of Centroid-based Algorithm researchers haveproposed some improved methods. However, what are the main factors that willinfluence the classification performance of Centroid-based Algorithm and how thesefactors will affect the classification performance of Centroid-based Algorithm is stillnot very clear.To solve these two problems this paper makes a further analysis on the inductivebias of Centroid-based Algorithm. On the basis of this to correct the inductive biasthis paper proposes " an improved centroid-based algorithm based on kernel methodsunder the empirical risk minimization principle" and evaluate its performance.Therefore, this paper is divided into three aspects as follows:1. To illustrate the main factors influencing on the performance of centroid-basedalgorithm, this paper make a further detailed analysis on the inductive bias ofcentroid-based algorithm based on previous studies from the perspective of preference bias and restriction bias and find that the inductive bias of centroid-based algorithm ismainly caused by two factors: the method for calculating class centroid vector in thetraining stage and the method for calculating the similarity between the class centroidand the text in the testing stage. Firstly, using the arithmetic average method tocalculate the centroid of each category in the training phase results in a preferencebias of centroid-based algorithm. This step causes it to produce a poorlyrepresentative class centroid and thus affects its classification performance. Secondly,using the cosine similarity measure to calculate the similarity between the classcentroid and the text in the testing stage results in a restriction bias of centroid-basedalgorithm. This step causes that it cannot solve the linear non-separable problem andalso affects its classification performance.2. To correct the preference bias and restriction bias of centroid-based algorithm,we put forward " an improved centroid-based algorithm based on kernel methodsunder the empirical risk minimization principle". In order to improve theclassification performance of centroid-based algorithm, this paper adopts theempirical risk minimization principle and the kernel method for correcting itspreference bias and restriction bias. Firstly, to correct the preference bias of thebasic centroid-based algorithm this paper adopts the empirical risk minimizationmodel to guide the adjustment of class centroid vector for pursuing betterrepresentative class centroid and proposes "a modified centroid-based algorithm underthe empirical risk minimization principle". Then on this basis to solve the problemthat it cannot solve data sets of linear inseparable case, this paper adopts the kernelmethods to correct the restriction bias of the basic centroid-based algorithm andproposes " an improved centroid-based algorithm based on kernel methods under theempirical risk minimization principle" finally.3. Finally we evaluate the classification performance of "the improvedcentroid-based algorithm based on kernel methods under the empirical riskminimization principle" proposed in this paper on the selected "tmdata" data sets.Experimental evaluation uses the macro average F1value and micro average F1valueas evaluation standard, selects the basic centroid-based algorithm and support vectormachine method as comparison algorithm, and adopts three-fold cross validationmode. The experimental evaluation results show that on the selected "tmdata" datasets the classification performance of the improved centroid-based algorithmproposed in this paper is better than that of basic centroid-based algorithm, comparable to that of support vector machine method. This proves that the improvedmethod used in this paper is a viable method to raise the classification performance ofcentroid-based algorithm.
Keywords/Search Tags:Document Classification, Centroid-based Algorithm, Inductive Bias, EmpiricalRisk Minimization, Kernel Method
PDF Full Text Request
Related items