Research On Technologies Of Efficient Data Access Based On Hadoop

Posted on:2017-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Wang

Full Text:PDF

GTID:2308330503987183

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the explosive growth of data volume, traditional relational database management system has been unable to deal with the data processing requirements at the age of big data. There is an urgent need for a tools to cope with the massive data storage and computing. Hadoop came into being at this environment. As a distributed system, Hadoop can utilize the cluster resources to store and process the massive data. Currently Hadoop has become the industry standard for massive data processing with its high reliability, scalability, fault tolerance and so on. Hadoop was designed to handle large-scale aggregation tasks, and this kind of tasks often need to process all of the data. So Hadoop will scan all of the data when we use it to process massive data. But with the development of time, people tend to use Hadoop to perform many other kind of tasks, such as select query task. And this kind of tasks need not to process all of the data, but Hadoop still scan the dataset. And it caused low data access efficiency of Hadoop. To deal with this problem, this paper introduced indexing mechanism for Hadoop to avoid data scan based on the experience of traditional relational database.Firstly, this paper analyzed Hadoop’s key components: HDFS and Map Reduce. Then we proposed two kinds of indexing schemes based on the analysis. The first scheme is global index based on data block. This scheme introduced the concept of distributed ordered tables, and then described the realization of this index scheme in four aspects: the index format, the storage of index, index creation and the use of index. The second scheme is distributed index on cluster. This scheme analyzed the organization of the distributed index first, and then compared global index and local index. At the end, we introduced the realization of this scheme and analyzed the fault tolerance of this indexed Hadoop system.Finally, we verified the effectiveness of the proposed two indexing schemes through a series of comparative experiments. We performed some select query tasks on both of the schemes, and analysis the result comparatively. The experimental results verified the feasibility of using the index to improve the efficiency of Hadoop’s data access. Finally, we summarized the content of this paper, analyzed the shortcomings of current study and put forward some ideas for further study.

Keywords/Search Tags:

massive data, Hadoop, index, data access

PDF Full Text Request

Related items

1	Research And Application Of Massive Data Processing Model Based On Hadoop
2	The Design And Implementation Of Massive Data Storage And Calculation Platform Based On Hadoop
3	Research On Unified Access Plantform For Unstructured Data And Index Technology
4	The Management Of Massive Images Data Based On Hadoop
5	Research On Hadoop Based Telecom Operators Massive Data Processing Techonology And Its Applications
6	Research And Application Of The Massive Web Data Analysis Based On Hadoop
7	Research Data Storage Index Mechanism Massive GML Space Ambient Cloud
8	Platform Development On Massive Data Collection And Processing Based On Hadoop
9	Reserach On Eigen-keneral-data Retrieving Algorithms Based On Hadoop
10	Research And Implementation Of Duplicate Data Clean-up Model Based On Hadoop