Font Size: a A A

The Research Of Feature Selection Method Based On Diversity Measure On Text Data

Posted on:2018-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:S L ChaoFull Text:PDF
GTID:2348330542459879Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text categorization is one of the key technologies to manage and organize text data effectively.It can help people locate information quickly and effectively to solve the problem of information clutter.Text categorization has great application value,which is a hot topic in the field of data mining.Because of the high dimensionality of text data,the efficiency and accuracy of text classification are greatly reduced.Feature selection becomes a vital part of the text classification.Feature selection can remove features that have less information or are irrelevant to the category,thus a distinct subset of features is obtained.It can effectively eliminate redundant features and noise,reduce the dimension of the data,thereby improving the accuracy and speed of classification.In this paper,we first introduce the process of text categorization and the related technologies.Then,the process of feature selection and related technologies are deeply studied.The main research work of this paper is as follows.The traditional feature selection algorithms generally only consider the relevance and redundancy between the features,and the diversity between the features is rarely considered.Based on the diversity of features,this paper proposes a feature selection algorithm which considers the relevance,redundancy and diversity between features.The traditional feature selection algorithms have little consideration for the diversity between features,which leads to the redundancy of feature subsets can not be completely eliminated.The feature algorithm proposed in this paper takes into account the diversity and redundancy between features,so that the redundancy between features is very small.At the same time,the relevance between features and categories can be guaranteed,so that the feature subset has stronger classification ability.In this paper,the information distance based on information theory is used to evaluate the diversity between features,and a balance coefficient is introduced to balance the redundancy and diversity between features.The algorithm is compared with JMI,IG,mRR algorithm,and the experimental results are analyzed.In this paper,a feature selection algorithm IDMCFS based on feature information distance clustering is proposed.The algorithm combines supervised and unsupervised learning.In the IDMCFS algorithm,the diversity between features is fully considered.Firstly,the K-medoids clustering algorithm is used to cluster the original feature set,and the high redundant features are clustered together.The features of different clusters have large diversity.The distance measure used in the clustering algorithm is the information distance,which is based on information theory.The value of the information distance between the features in the clustering iteration is invariant and only needs to be calculated once,which greatly reduces the computational complexity of the algorithm.After clustering,from each cluster we choose one feature with the largest mutual information to class to form a feature subset.And the mRMR rule is used to select the m features in the feature subset to ensure the correlation between the selected feature subset and the category.We have carried out some experiments on the proposed algorithm,compare it with mRMR,CMIM and ReliefF,and the results were analyzed...
Keywords/Search Tags:Text categorization, Feature Selection, Information Theory, Information Distance, Diversity
PDF Full Text Request
Related items