Font Size: a A A

Key Technologies For Big Data Processing

Posted on:2017-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:1318330533455181Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Big data aids us in trends prediction and business decision making,while at the same time brings us enormous new challenges.Especially,during the procedure of Big data processing,there are a series of key problems need to be resolved.The existence of excessive redundant computation results in a huge waste of computing and storage resources.The file access pattern is non-uniform in Big data processing,however,the existing storage architecture does not suit this pattern very well.After adopting multi-replicas strategy to enhance the reliability of metadata of Big data platform,the metadata replication process and operations of updating metadata are both inefficient.The last but not the least,reducing the cost of providing disaster recovery for Big data is still a difficult problem.This thesis focuses on the key challenges in Big data processing listed above,the main contents and contributions of this thesis are summarized as follows:For the problem of how to detect redundant computations rapidly,this thesis presents a duplicate query detection mechanism based on pre-classification.After the classification according to the features of each query,the duplicate detection of each query statement is performed only in a subset of the historical data.With the objective of preventing redundant computation,the mechanism avoids the detection time grows too fast with the expansion of the historical data.For the characteristic of non-uniform file access in Big data processing,this thesis proposes a resolution based on a tiered storage architecture,which canrecognize the hot data in global dataset according to current workload,and unitizes a shared storage cluster to accelerate the processing of the hot data.For the problem of the inefficient metadata replication caused by adopting multi-replicas strategy,this thesis proposes a metadata replication method based on separated replication strategy,the replications of in-memory metadata and on-disk operations log are handled independently,which avoids disk I/O during the process of metadata replication.While ensuring the reliability of the metadata,the method this thesis proposes shortens the time of metadata replication.For the issue of metadata consistency in Big data platform,this thesis proposes an atomic commitment protocol called Batch-2PC,this protocol uses batch executions and commitments strategy to reduce the network delays generated by handling a plurality of operations of updating metadata,and further shortens completion time of updating metadata by conflict detection.This thesis also designs and implements a disaster recovery system for key information,this system can provide an efficient disaster recovery solution for Big data platform.The system adopts cloud storage to reduce the cost of disaster recovery,and uses data deduplication to optimize the remote data transmission and data recovery time.
Keywords/Search Tags:Big data, duplicate query detection, tieredstorage, replication, consistency
PDF Full Text Request
Related items