Dimension Reduction Technology Research Based On Text Features

Posted on:2019-07-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wu

Full Text:PDF

GTID:2428330572495085

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,massive text data is produced in the network.How to analyze and deal with these massive data to mine useful information has become a difficult problem to be solved.Data mining technology emerges at the historic moment.With the rapid increase of data features dimension,it is a huge challenge to the data mining task.The dimensionality reduction technology is an effective method for preprocessing high-dimensional data.The feature selection algorithm is a widely used technique of dimensionality reduction.However,the traditional single feature selection algorithm has the problems of redundancy,noise information and low clustering accuracy.To solve these problems,this paper studies the following two aspects:For the deficiency of single feature selection algorithm and improving the clustering effect,a two-stage text feature selection algorithm based on differential evolution is proposed.In the filtering stage,the algorithm uses variance and average median to calculate the feature-related scores,fuses the features of the previous score,filters unrelated features,and obtains high-relevancy feature subsets to achieve the initial dimensionality reduction.In the encapsulation stage,an improved differential evolution algorithm is used to extract the optimal feature subset and further reduce the dimension.The improved differential evolution algorithm constructs the fitness function via the frequency of the document and the cumulative feature word frequency,and introduces local optimal features and multiple difference vector strategies in the mutation operation,which accelerates the convergence speed of the algorithm and improves the global search ability of the algorithm.Simulation results evaluated on different dataset indicate that the proposed algorithm can effectively reduce the dimension of text feature space in clustering and have the obvious improvements on the values of precision rate,recall rate and F1.In order to remove the noise features in the feature subset and improve inter-class discrimination of the feature selection algorithm,a three-stage text feature selection algorithm is proposed.In the first stage,the algorithm uses the improved average absolute difference method to filter the irrelevant features from the original feature space.In the second stage,the redundant feature is removed from the relevant feature space by using the absolute cosine method combined with the feature length.In the third stage,the principal component analysis method are used to convert the high-dimensional correlation and non-redundant feature space into the noiseless low-dimensional feature space,while retaining the valuable text information to obtain the optimal feature subset.Compared with other algorithms,simulation experiments show that the proposed algorithm can obtain better accuracy,recall rate and F1 value,which effectively remove the noise features,and have good inter-class discrimination distinction for the selected feature subsets.

Keywords/Search Tags:

Data mining, Dimension reduction, Feature selection, Principal component analysis, Text clustering

PDF Full Text Request

Related items

1	Immune Clonal Selection Based Dimension Reduction And Applications
2	A Dimension Reduction Method For Large-scale TExt Categorization
3	A Dimension Reduction Method For Large-scale Text Categorization
4	The Application Of Clustering Analysis Based On Principal Component Analysis And Rough Set In Financial Index Data
5	Application Of PCA Dimensionality Reduction Method Based On Latent Variables In Text Classification Problems
6	Unsupervised Clustering Algorithm Based On Dimension Reduction
7	Clustering Algorithm Research Based On The Bilinear Probabilistic Principal Component Analysis
8	Research On SOFM Text Clustering Algorithm
9	Study On Several Issues Of Text Clustering
10	Secure And Efficient Dimension-reducing Ranked Query Method For Encrypted Cloud Data