Research On Analysis Model And Distributed Parallel Clustering Method Of High-dimensional Big Data

Posted on:2019-10-19

Degree:Master

Type:Thesis

Country:China

Candidate:F F Zhou

Full Text:PDF

GTID:2428330548485708

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,the data not only accumulate in large quantities but also grow rapidly in recent years,which means that the era of big data has come.Big data is widespread in all fields and has become an important economic asset for human development.Effective data analysis and mining will promote the efficient and sustainable development of countries,enterprises and the whole society.Research plans about the application of large data have been carried out from country to country.As a result of continuous expansion of the observation angle and the depth of understanding,tens of thousands or even more high-dimensional big data continue to produce in actual environment.Confronting high-dimensional big data,classification,clustering and other data analysis methods are often unsatisfactory,inefficient or even completely unavailable due to the dimensionality disaster brought by high dimension and the processing load caused by large amount of data.This article analyzes the existing problems in high-dimensional big data analysis,and summarizes the domestic and foreign research results on dimensionality reduction,clustering,classification of high-dimensional data and big data processing techniques.It is pointed out that the feature extraction of high-dimensional data is an advantageous method to reduce the dimension of data and decrease the workload of artificial feature selection.In this article,the disadvantages of using deep neural network as learning model for feature extraction of high-dimensional data are pointed out.For the classification of high-dimensional data,another deep neural network called multilayer extreme learning machine is used as the basic model to construct multilabel classifier,and the classification experiment of multiple power quality disturbances is carried out.By contrast,it can not only obtain better classification results,but also reach a high efficiency.In addition,for solving the problem that although k-means clustering algorithm is easy to use and has many other advantages,it has a poor applicability for high-dimensional data,the unsupervised extreme learning machine is used to reduce the dimension of data before clustering.Compared with the experiments without dimensionality reduction or using other dimensionality reduction algorithms,it is concluded that the clustering result of this method is more consistent with actual law and its clustering efficiency is higher.Based on the random matrix theory,a feature extraction method for high-dimensional data called FEMPL is proposed.It is suitable for the analysis of ultra-high-dimensional data.In this article,random matrix and its Mar?enko-Pastur law theory are briefly described.From the discrepancy of eigenvalue limiting spectral distribution among random matrixs and non-random matrices,the idea that this discrepancy can be used for feature extraction is derived.In this article,the method to represent the data as a matrix and the specific feature constitution of FEMPL are given,and the steps of FEMPL feature extraction are described.The validity of FEMPL was validated by two cases that the classification of multiple power quality disturbances signals and embedded analysis of user's electric load data.The cases also shows that FEMPL has very flexible requirements on the data organization.Because there is no coupling among data samples in the feature extraction process,FEMPL is easy to parallelize.In order to alleviate the computational load of high-dimensional big data,a basic model of data analysis using parallel FEMPL method in distributed environment is given.Taking k-means clustering analysis as an example,a distributed parallelization cluster analysis process which uses MapReduce computation model to combine FEMPL with k-means was provided.

Keywords/Search Tags:

high-dimensional big data, feature extraction, extreme learning machine, random matrix, distributed parallelization

PDF Full Text Request

Related items

1	Application Research On Feature Extraction And Classification Of EEG Signal With The Method Of ELM
2	Research And Application Of High Dimensional Data Manifold Structure
3	Research On ELM Image Classification Combining HOG And Random Forest
4	On The Theory And Applications Of Feature Learning From High Dimensional Data
5	Efficient Data Representation Combining With ELM And NMF
6	Research On Classification Methods Based On Extreme Learning Machine
7	The Research On Sar Image Target Recognition Technology Based On Feature Fusion And Extreme Learning Machine
8	Parallelization Research And Implementation Of Obtain And Use On Key Elements In 3D Reconstruction
9	High Dimensional Multispectral Data Classification By Machine Learning
10	Research On DOA Estimation Of Large Arrays Based On Machine Learning And Random Matrix Theory