Font Size: a A A

Dimension Reduction Technology Research Based On Text Features

Posted on:2019-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2428330572495085Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,massive text data is produced in the network.How to analyze and deal with these massive data to mine useful information has become a difficult problem to be solved.Data mining technology emerges at the historic moment.With the rapid increase of data features dimension,it is a huge challenge to the data mining task.The dimensionality reduction technology is an effective method for preprocessing high-dimensional data.The feature selection algorithm is a widely used technique of dimensionality reduction.However,the traditional single feature selection algorithm has the problems of redundancy,noise information and low clustering accuracy.To solve these problems,this paper studies the following two aspects:For the deficiency of single feature selection algorithm and improving the clustering effect,a two-stage text feature selection algorithm based on differential evolution is proposed.In the filtering stage,the algorithm uses variance and average median to calculate the feature-related scores,fuses the features of the previous score,filters unrelated features,and obtains high-relevancy feature subsets to achieve the initial dimensionality reduction.In the encapsulation stage,an improved differential evolution algorithm is used to extract the optimal feature subset and further reduce the dimension.The improved differential evolution algorithm constructs the fitness function via the frequency of the document and the cumulative feature word frequency,and introduces local optimal features and multiple difference vector strategies in the mutation operation,which accelerates the convergence speed of the algorithm and improves the global search ability of the algorithm.Simulation results evaluated on different dataset indicate that the proposed algorithm can effectively reduce the dimension of text feature space in clustering and have the obvious improvements on the values of precision rate,recall rate and F1.In order to remove the noise features in the feature subset and improve inter-class discrimination of the feature selection algorithm,a three-stage text feature selection algorithm is proposed.In the first stage,the algorithm uses the improved average absolute difference method to filter the irrelevant features from the original feature space.In the second stage,the redundant feature is removed from the relevant feature space by using the absolute cosine method combined with the feature length.In the third stage,the principal component analysis method are used to convert the high-dimensional correlation and non-redundant feature space into the noiseless low-dimensional feature space,while retaining the valuable text information to obtain the optimal feature subset.Compared with other algorithms,simulation experiments show that the proposed algorithm can obtain better accuracy,recall rate and F1 value,which effectively remove the noise features,and have good inter-class discrimination distinction for the selected feature subsets.
Keywords/Search Tags:Data mining, Dimension reduction, Feature selection, Principal component analysis, Text clustering
PDF Full Text Request
Related items