Font Size: a A A

Key Technology Research On Mixed Store And Two Level Index Of High-dimensional Big Data In Hadoop

Posted on:2017-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2308330482499743Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of big data and Internet, the economy in all walks of life gained exponential growth of data, such as network transactions, user reviews, and other high-dimensional data having a large high-dimensional feature data, which not only are huge but also have complex structure. High-dimensional big data have both typical characteristics of high-dimensional data and big data. The traditional high-dimensional data storage and indexing technology can not meet the needs of data expansion, while the traditional big data storage and indexing technology do have difficult to deal with excessive dimensions that is called "curse of dimensions".According to the characteristics of high-dimensional big data, we proposed a suitable storage and indexing techniques for high-dimensional big data, and proposed corresponding query algorithms over the index structure. All the implements are built on Hadoop platform.Firstly, According to the US-ELM-FC clustering algorithm,the dimensions with strong correlation are clustered together. The key dimensions are selected from each cluster. Using a key dimension to represent the corresponding cluster, which can be seen as reducing dimensions and reducing data redundancy/correlation between dimensions but maintaining the characteristics of the data itself. After that, we divided the key dimension and non-key dimension vertically; the key dimensions are stored in HBase, and the non-key dimensions are stored in HDFS. According to the HDFS block size, we horizontal partitioned the data and stored the data after compression. Then it has been built the storagestructure for high-dimensional big data called HB-File.Secondly, we construct a index structure based the cluster result of US-ELM-FC algorithm. We divided the data space by variable grid, and merged the data grid into a data subspace. After the division and merge, the data elements of each subspace are relative to the average, then built M-Tree index on each subspace and formed the local index, and built the global index by the positional relationship between different subspaces, then the local index and the global index formed a variable gird based distribution two layer index structure called VGHI.Then, on the above index structure, we proposed the corresponding query algorithms. The core idea of all the algorithms is to locate the query by the position information of global index and to determine which subspaces are the query-related, then forward the query to each related node and execute the query on each node and return the final result.Finally, a large number of experiments on experiment data set with different data size and different dimensions show that the storage model HB-File and the index structure VGHI are suitable for high-dimensional big data storage and indexing respectively. HB-File and VGHI are both efficient and scalable for high-dimensional big data.
Keywords/Search Tags:high dimensional data, big data, variable grid, HDFS, ELM
PDF Full Text Request
Related items