Font Size: a A A

Research On Data Index Application In The MapReduce Framework

Posted on:2016-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q N LiuFull Text:PDF
GTID:2428330482981288Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of cloud computing and networking,sensors and microprocessors are widely used in every corner of the earth.Rich data source resulting in the sustained explosive growth of data resource,the complexity of data is also increasing.We have been living in the era of big data.How to manage the vast amounts of data and improve the ability of analyzing massive data effectively is a central issue of the academic research.MapReduce programming model is a key technology of big data.It breaks the operations of large-scale datasets(greater than 1TB)into a number of parallel computations,with parallel processing data across a large number of computing nodes.Based on the MapReduce framework,this paper research data indexing technology as follows:Firstly,this paper analyzes the advantages of parallel computing and processing of data and tasks on large-scale clusters in the MapReduce programming mode.The method to optimize data block partition and data storage based on the MapReduce framework is proposed.The data are uniformly distributed into data blocks based on the principles of relevance and distributed.Secondly,this paper analyzes the traditional indexing technology,high-dimensional indexing technology and indexing technology based on the MapReduce framework.To achieve the purpose of simplifying the search space,approximate vector presentation utilizes a simple vector compactly represents the corresponding high-dimensional vector.One dimensional transformation converters the high-dimensional vector into a one-dimensional representation.The BC-iDistance combines the two techniques and compresses a d-dimension vector as a 2-dimension vector.In this paper,high-dimensional vector compression is based on the BC-iDistance.Distributed index structure with double layers are designed.During searching,three-layer data filtering is realized by using global indexes,local indexes and index values of two-dimensional bitcodes.In this way,both search range and calculation amount of high-dimensional vectors are reduced.Thirdly,the application of massive data index has also been studied in this paper.The problem of "resource overload" has broken out with the era of big data,which brings new challenges to a variety of data query systems.Personalized recommendation system is a common application of artificial intelligence.Parallel query method of personalized recommendation is proposed based on the designed double-layer index.The analysis and clustering of massive Web resources can be finished offline on the basis of data-partition strategy,which improves the efficiency of application.Finally,we verify the validity of the proposed method by experiments.The experiments show that the data partitioning strategy and high-dimensional data index based on MapReduce are very effective and practicable for improving the query efficiency of high-dimensional data.
Keywords/Search Tags:MapReduce, data index, KNN query, high-dimensional vector, cloud computing
PDF Full Text Request
Related items