Font Size: a A A

Research On Data Storage And Search Methods Of Structured Data Based On HDFS

Posted on:2015-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:M M YangFull Text:PDF
GTID:2268330431955433Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The "4V" character of big data:volume, variety, value and velocity, has made the original relational database cluster unable to hold huge structured data. Therefore, the database based on distributed file system has been a research hotspot. It takes the Hadoop Distributed File System (HDFS) to store data and adapts Massively Parallel Processing architecture as scheduling engine. The HDFS is always deployed on some nodes which have independent infrastructure and connect to each other via Internet. Among them, one is responsible for metadata storage and others for file data. All data communication information is transformed by network. Currently, the relational database based on HDFS has those following insufficiencies:First, it cannot be employed on different datacenters, which has no query function across datacenters. Second, it distributes data belongs to one table to less storage nodes without any optimization, that deeply reduces the concurrency of table traversal in effect. Third, to load balancing, the data must be migrated when the amount of nodes changed. All nodes are involved in migration process, thus the long migration has bad influence on real-time query to database based on HDFS.Supported by242issue "Research and verification on cloud storage key technologies of relational data", this thesis focuses on exploratory research of data storage, search and migration method from the aspect of HDFS storage. And we achieve to employ and query the relational database based on HDFS on different datacenters.The main contributions of this work include three aspects. First, the deployment on different datacenters of Impala which is a database based on HDFS. This thesis deploys Impala system on different datacenters across WAN or across regions. Second, the research of data storage and search method based on circular distributed hash. This thesis adopts distributed hash table and CHORD ring to distributed file system. Every data and storage node will be calculated the hash value, by which we map the data to its storage node. According to the saved metadata, we take use of binary search method to find the location of data. Third, the research of data migration based on circular distributed hash. If a new data node is added in HDFS, its "neighbor" will transport some data to it. Similarly, when a node breaks down, its data will be made backup to its "neighbor".The main innovative contributions of this thesis are as follows. First, deploying Impala system to datacenters across regions, which improves the impala’s business scope and offers support for big data application across regions; Second, proposing data storage and search method based on circular distributed hash, which distributes the data evenly, improves the concurrency of table traversal and reduces the query time; Third, proposing data migration method based on circular distributed hash, which saves migration process time, and guarantees the consistency and effectiveness for database usage.Finally, this thesis has given the emulation experiments for the data storage, search and migration method and has proved the effectiveness of the proposed methods, in comparison with the original HDFS strategy.
Keywords/Search Tags:Distributed File System, Distributed Hash, Data Storage, DataMigration, Datacenters across Regions
PDF Full Text Request
Related items