Font Size: a A A

An Algorithm To Hierarchical Text Classification Based On Feature Selection

Posted on:2014-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:L ShiFull Text:PDF
GTID:2268330425966098Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of the network, There is amount of data on the web to process everyday, most of the data on the Web exists in the form of text, now how to classify these texts,the basic method is to train and collect the category features through the pages of the trainingsample set firstly, then compare the pages which is ready to classify with the category-featureset by the feather similarity, finally divide the pages to the corresponding category. But peopleusually use the plane-type classification method in the traditional text classification, themethod, that all classes are considered at the same level, doesn’t consider the hierarchicalrelationships among classes, so it leads to a large amount of redundancy. Therefore, how toclassify the text combined with the hierarchical structure in the categories is a topic withactual significance. In addition, how to reduce unbalanced date set on the influence of theclassification and improve the accuracy of classification, when faced with the unbalanced dataset, also is a main direction of text classification currently.Firstly, studies the background and relevant theory in the text classification and featureselection, analysis and summarizes the current situation of the text classification. Based onthat, analysis and studies to the hierarchical text categorization deeply, and focuses on theanalyze of the factors which affect the performance and effectiveness in hierarchical textclassification especially from two aspects, first of all, from the point of the feature selection,analysis the methods of the feature selection, and introduces the concepts of the hierarchicalrelevancy and hierarchical redundancy, proposes the rrHTC algorithm that is used to removedthe redundant feathers and reduced the redundant feathers on the influence of the precision oftext categorization, then from the angle of improving the algorithm of classification, analyzesthe SVM-KNN algorithm, improves the SVM-KNN algorithm by introducing the concepts ofsample center distance, proposes the c-SVM-KNN algorithm. At the end of paper, verifies thetwo algorithms through20NG dataset and the dataset which is grabbed from NetEase. Theresults show that it can improves the accuracy of classification effectively through the rrHTCalgorithm to select features and the c-SVM-KNN algorithm to classify the text.
Keywords/Search Tags:Text classification, rrHTC, sample Center distance, Imbanlance data, SVM-KNN
PDF Full Text Request
Related items