Font Size: a A A

Research On Technologies Of Efficient Data Access Based On Hadoop

Posted on:2017-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z WangFull Text:PDF
GTID:2308330503987183Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of data volume, traditional relational database management system has been unable to deal with the data processing requirements at the age of big data. There is an urgent need for a tools to cope with the massive data storage and computing. Hadoop came into being at this environment. As a distributed system, Hadoop can utilize the cluster resources to store and process the massive data. Currently Hadoop has become the industry standard for massive data processing with its high reliability, scalability, fault tolerance and so on. Hadoop was designed to handle large-scale aggregation tasks, and this kind of tasks often need to process all of the data. So Hadoop will scan all of the data when we use it to process massive data. But with the development of time, people tend to use Hadoop to perform many other kind of tasks, such as select query task. And this kind of tasks need not to process all of the data, but Hadoop still scan the dataset. And it caused low data access efficiency of Hadoop. To deal with this problem, this paper introduced indexing mechanism for Hadoop to avoid data scan based on the experience of traditional relational database.Firstly, this paper analyzed Hadoop’s key components: HDFS and Map Reduce. Then we proposed two kinds of indexing schemes based on the analysis. The first scheme is global index based on data block. This scheme introduced the concept of distributed ordered tables, and then described the realization of this index scheme in four aspects: the index format, the storage of index, index creation and the use of index. The second scheme is distributed index on cluster. This scheme analyzed the organization of the distributed index first, and then compared global index and local index. At the end, we introduced the realization of this scheme and analyzed the fault tolerance of this indexed Hadoop system.Finally, we verified the effectiveness of the proposed two indexing schemes through a series of comparative experiments. We performed some select query tasks on both of the schemes, and analysis the result comparatively. The experimental results verified the feasibility of using the index to improve the efficiency of Hadoop’s data access. Finally, we summarized the content of this paper, analyzed the shortcomings of current study and put forward some ideas for further study.
Keywords/Search Tags:massive data, Hadoop, index, data access
PDF Full Text Request
Related items