Font Size: a A A

A Research Of Key Technology Of Dimensionality Reduction Of High Dimensional Data

Posted on:2018-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:C J LiFull Text:PDF
GTID:2348330512983173Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology,the representation of information becomes more and more comprehensively.It is more and more easily for people to get information.And the data objects concerned about are complex increasingly.The demand for data analysis and processing technology of industry is most urgent,especially for high-dimensional data.Dimensionality reduction for high-dimensional data has obtained much attention in the field of computer research and related areas.In era of parallel computing and distributed computing,that how to realize the analysis and processing of massive high-dimensional data by using distributed environment is a challenging problem that needs to be resolved by the present research,which has important research value and significance of practical application.It will face many problems to deal with high-dimensional data directly,such as curse of dimensionality,"Algorithm failure" phenomenon.An effective way is to reduce the dimensionality of high dimensional data.The principal component analysis(PCA)algorithm is a classical linear dimensionality reduction technique,which is simple and has the advantages of free of linearity error,no parameter limitation and so on.It is suitable for linear data,but the memory consumption is large and the computational complexity is high.When PCA is used for dimensionality reduction of high-dimensional sparse big data,there is a lot of difficulty in calculating the covariance matrix between features.In view of the problems mentioned above,the main work of this thesis is as follows:1)Propose an improved PCA algorithm called E-PCAWhen PCA is applied to dimensionality reduction of high-dimensional sparse big data,computers can't read all of the features into memory to analyze and calculate at one time because the feature dimension is too high.Block processing technology seeks to mitigate this problem but time-consumed during block processing is too long to meet the actual application requirements.In view of above-mentioned problems,the information entropy is introduced,and the algorithm of dimensionality reduction of high dimensional sparse big data based on information entropy(E-PCA)is proposed in this thesis.Selecting features according to the characteristics of the information entropy value,and reducing the number of features greatly.And then,extracting features through matrix transformation to achieve the purpose of double dimensionality reduction.The experimental results show that the E-PCA is effective compared with Block PCA from the four aspects of memory occupancy,time-consuming,dimensionality after dimensionality reduction and accuracy of classification.2)Propose distributed processing framework for dimensionality reduction of high-dimensional data based on MapReduceAnalyzing the principal of distributed implementation of Hadoop.Aiming at the algorithm PCA and E-PCA,the distributed dimensionality reduction processing framework of PCA based on MapReduce and distributed dimensionality reduction processing framework of E-PCA based on MapReduce are proposed in this thesis.Establishing the cluster platform of Hadoop,writing the implementation code,and then achieving the distributed processing of PCA and E-PCA algorithm on the platform.It is further proved that E-PCA is better than PCA algorithm by testing with real high-dimensional sparse big data.
Keywords/Search Tags:high-dimensional sparse big data, dimensionality reduction, principal component analysis, distributed processing
PDF Full Text Request
Related items