Key Technology Research On Mixed Store And Two Level Index Of High-dimensional Big Data In Hadoop

Posted on:2017-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2308330482499743

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of big data and Internet, the economy in all walks of life gained exponential growth of data, such as network transactions, user reviews, and other high-dimensional data having a large high-dimensional feature data, which not only are huge but also have complex structure. High-dimensional big data have both typical characteristics of high-dimensional data and big data. The traditional high-dimensional data storage and indexing technology can not meet the needs of data expansion, while the traditional big data storage and indexing technology do have difficult to deal with excessive dimensions that is called "curse of dimensions".According to the characteristics of high-dimensional big data, we proposed a suitable storage and indexing techniques for high-dimensional big data, and proposed corresponding query algorithms over the index structure. All the implements are built on Hadoop platform.Firstly, According to the US-ELM-FC clustering algorithm,the dimensions with strong correlation are clustered together. The key dimensions are selected from each cluster. Using a key dimension to represent the corresponding cluster, which can be seen as reducing dimensions and reducing data redundancy/correlation between dimensions but maintaining the characteristics of the data itself. After that, we divided the key dimension and non-key dimension vertically; the key dimensions are stored in HBase, and the non-key dimensions are stored in HDFS. According to the HDFS block size, we horizontal partitioned the data and stored the data after compression. Then it has been built the storagestructure for high-dimensional big data called HB-File.Secondly, we construct a index structure based the cluster result of US-ELM-FC algorithm. We divided the data space by variable grid, and merged the data grid into a data subspace. After the division and merge, the data elements of each subspace are relative to the average, then built M-Tree index on each subspace and formed the local index, and built the global index by the positional relationship between different subspaces, then the local index and the global index formed a variable gird based distribution two layer index structure called VGHI.Then, on the above index structure, we proposed the corresponding query algorithms. The core idea of all the algorithms is to locate the query by the position information of global index and to determine which subspaces are the query-related, then forward the query to each related node and execute the query on each node and return the final result.Finally, a large number of experiments on experiment data set with different data size and different dimensions show that the storage model HB-File and the index structure VGHI are suitable for high-dimensional big data storage and indexing respectively. HB-File and VGHI are both efficient and scalable for high-dimensional big data.

Keywords/Search Tags:

high dimensional data, big data, variable grid, HDFS, ELM

PDF Full Text Request

Related items

1	Research On Clusrering Algorithm Of High Dimensional Data
2	Research On Clustering Algorithm Over High Dimensional Data Stream Based On Grid And Sequence Data
3	Research On Clustering Algorithm Over High Dimensional Data Stream Based On Irregular Grid Data
4	Particle swarm optimizer: Applications in high-dimensional data clustering
5	Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule
6	The Research On The Algorithms Of Grid-based Data Stream Clustering
7	Research On Outlier Detection Algorithm For High Dimensional Big Data
8	Research On High Dimensional Data Clustering Based On Improved Evolutionary Algorithm
9	Research On Clustering Method Of Datastream Based On Grid And Density
10	Pre-processing methods and stepwise variable selection for binary classification of high-dimensional data