An Algorithm To Hierarchical Text Classification Based On Feature Selection

Posted on:2014-09-26

Degree:Master

Type:Thesis

Country:China

Candidate:L Shi

Full Text:PDF

GTID:2268330425966098

Subject:Computer software and theory

Abstract/Summary:

With the popularity of the network, There is amount of data on the web to process everyday, most of the data on the Web exists in the form of text, now how to classify these texts,the basic method is to train and collect the category features through the pages of the trainingsample set firstly, then compare the pages which is ready to classify with the category-featureset by the feather similarity, finally divide the pages to the corresponding category. But peopleusually use the plane-type classification method in the traditional text classification, themethod, that all classes are considered at the same level, doesnâ€™t consider the hierarchicalrelationships among classes, so it leads to a large amount of redundancy. Therefore, how toclassify the text combined with the hierarchical structure in the categories is a topic withactual significance. In addition, how to reduce unbalanced date set on the influence of theclassification and improve the accuracy of classification, when faced with the unbalanced dataset, also is a main direction of text classification currently.Firstly, studies the background and relevant theory in the text classification and featureselection, analysis and summarizes the current situation of the text classification. Based onthat, analysis and studies to the hierarchical text categorization deeply, and focuses on theanalyze of the factors which affect the performance and effectiveness in hierarchical textclassification especially from two aspects, first of all, from the point of the feature selection,analysis the methods of the feature selection, and introduces the concepts of the hierarchicalrelevancy and hierarchical redundancy, proposes the rrHTC algorithm that is used to removedthe redundant feathers and reduced the redundant feathers on the influence of the precision oftext categorization, then from the angle of improving the algorithm of classification, analyzesthe SVM-KNN algorithm, improves the SVM-KNN algorithm by introducing the concepts ofsample center distance, proposes the c-SVM-KNN algorithm. At the end of paper, verifies thetwo algorithms through20NG dataset and the dataset which is grabbed from NetEase. Theresults show that it can improves the accuracy of classification effectively through the rrHTCalgorithm to select features and the c-SVM-KNN algorithm to classify the text.

Keywords/Search Tags:

Text classification, rrHTC, sample Center distance, Imbanlance data, SVM-KNN

Related items

1	Research On Improved Naive Bayes Classification Model For Imbalanced E-commerce Review Text
2	Research On Text Classification With Few Samples Based On GAN
3	Statistical Automatic Text Classification Methods In Digital Libraries
4	Research And Implementation Of Sensitive Text Classification Algorithm Based On Artificial Immune System
5	A Method Dealig With Sample Imbalances In Text Classification
6	Knn Text Classification Algorithm Based On The Semantics Of The Center
7	The Research Of Text Classification Based On Distance Metric Learning
8	The Design And Implementation Of Call Center Text Classification System
9	Neural Networks For Small Sample Data Classification Intergraded With Decentralized Technology
10	The Study Of Complex Data Processing Method Based On Classification