A Research Of Key Technology Of Dimensionality Reduction Of High Dimensional Data

Posted on:2018-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:C J Li

Full Text:PDF

GTID:2348330512983173

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of information technology,the representation of information becomes more and more comprehensively.It is more and more easily for people to get information.And the data objects concerned about are complex increasingly.The demand for data analysis and processing technology of industry is most urgent,especially for high-dimensional data.Dimensionality reduction for high-dimensional data has obtained much attention in the field of computer research and related areas.In era of parallel computing and distributed computing,that how to realize the analysis and processing of massive high-dimensional data by using distributed environment is a challenging problem that needs to be resolved by the present research,which has important research value and significance of practical application.It will face many problems to deal with high-dimensional data directly,such as curse of dimensionality,"Algorithm failure" phenomenon.An effective way is to reduce the dimensionality of high dimensional data.The principal component analysis(PCA)algorithm is a classical linear dimensionality reduction technique,which is simple and has the advantages of free of linearity error,no parameter limitation and so on.It is suitable for linear data,but the memory consumption is large and the computational complexity is high.When PCA is used for dimensionality reduction of high-dimensional sparse big data,there is a lot of difficulty in calculating the covariance matrix between features.In view of the problems mentioned above,the main work of this thesis is as follows:1)Propose an improved PCA algorithm called E-PCAWhen PCA is applied to dimensionality reduction of high-dimensional sparse big data,computers can't read all of the features into memory to analyze and calculate at one time because the feature dimension is too high.Block processing technology seeks to mitigate this problem but time-consumed during block processing is too long to meet the actual application requirements.In view of above-mentioned problems,the information entropy is introduced,and the algorithm of dimensionality reduction of high dimensional sparse big data based on information entropy(E-PCA)is proposed in this thesis.Selecting features according to the characteristics of the information entropy value,and reducing the number of features greatly.And then,extracting features through matrix transformation to achieve the purpose of double dimensionality reduction.The experimental results show that the E-PCA is effective compared with Block PCA from the four aspects of memory occupancy,time-consuming,dimensionality after dimensionality reduction and accuracy of classification.2)Propose distributed processing framework for dimensionality reduction of high-dimensional data based on MapReduceAnalyzing the principal of distributed implementation of Hadoop.Aiming at the algorithm PCA and E-PCA,the distributed dimensionality reduction processing framework of PCA based on MapReduce and distributed dimensionality reduction processing framework of E-PCA based on MapReduce are proposed in this thesis.Establishing the cluster platform of Hadoop,writing the implementation code,and then achieving the distributed processing of PCA and E-PCA algorithm on the platform.It is further proved that E-PCA is better than PCA algorithm by testing with real high-dimensional sparse big data.

Keywords/Search Tags:

high-dimensional sparse big data, dimensionality reduction, principal component analysis, distributed processing

PDF Full Text Request

Related items

1	Comparative Study On Sparse Principal Component Analysis
2	Research And Implementation Of Incremental Dimensionality Reduction Methods For Big Data
3	Research On Data Stream Dimensionality Reduction Algorithm
4	Nonlinear dimensionality reduction using probabilistic principal surfaces
5	Research And Application Of Density Clustering Algorithm Based On Kernel Principal Component And High Dimensional Distance
6	Secure And Efficient Dimension-reducing Ranked Query Method For Encrypted Cloud Data
7	Research On Dimension Reduction Methods Of High Dimensional Data
8	Research On FMRI Visual Information Decoding Based On Multi-Voxel Pattern Analysis
9	Research Of Method And Application On Dimensionality Reduction Of High Dimensional Data Based On Multivariate Chart
10	An Empirical Study of Novel Approaches to Dimensionality Reduction and Applications