Font Size: a A A

Research Of Several Key Techniques On Distributed Data Processing

Posted on:2019-01-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:R K WuFull Text:PDF
GTID:1368330590970381Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,large-scale data is produced constantly and the data shows a trend of various of formats and types.How to efficiently realize big data processing has become the current research's key points and difficult points.On basis of the data processing in distributed system,this thesis proposes a series of software mechanisms and techniques about distributed data processing and storage.Specifically,the following research aspects are conducted:(1)On the basis of big data processing research of "Sunway Taihu Light",we realize a distributed parallel computing framework,SunwayMR.In this way,we can make full use of servers in cluster system to speed up data processing and analysis.(2)We propose a software construction technique for constructing a distributed parallel computing framework,so as to help users rapidly and effectively obtain autonomous software.(3)We propose a RDMA and HTM friendly key-value storage library RHKV,which can be utilized in distributed system.This storage library makes full use of the technologies of RDMA(Remote Direct Memory Access)and HTM(Hardware Transaction Memory).In this way,basic data operations,such as "put" and "get",can be conducted for data-intensive computing.(4)We propose the general solution of EDAWS: a novel distributed framework with efficient data analytics workspace towards discriminative service for critical infrastructures.Actually,the details are described as follows:(1)We present a parallel computing framework called SunwayMR,which only needs GCC/G++ environment.Distributed data partitioning,message communication and task organization are given to support transparent application execution on parallel hardware.To ensure ease-of-use,open Application Programming Interface(API)excerpts can be invoked by various applications with fewer handwritten code than OpenMPI/MPI.Results indicate that SunwayMR(runs on 16 computational nodes)has good scaling with data sizes,computing nodes and threads.(2)In order to analyze distributed parallel computing systems' inner requirements,from the perspective of the software's construction technique,the architecture modeling should be analyzed.However,ill-defined architecture model might lead to system's disruption problem.Meanwhile,systematic variations are unpredictable during the design and development phase.To address these deficiencies,this thesis presents a selfadaptive architecture modeling(process specification),which supports architecture design,behavior analysis and self-adaptation together during developing distributed parallel computing systems.We summarize the software construction process of our prototype system,SunwayMR,and present the empirical study,so as to illustrate the method's usefulness and effectiveness.(3)Exploiting spare DRAM of machines in distributed environment to design the storage through keyvalue abstraction has proved as an attractive option to provide fast data access for data-intensive computing.Storage systems can be treated as the new generation of the enterprise storage's atchitecture,and they are actually the improtant solution,responsing to storage capacity pressures,the I/O performance bottlenecks and storage costs crisis.However,due to the drawbacks of network round trips and requesting conflicts,remote data access over traditional commodity networking technology might incur high latency.This thesis proposes RHKV: a novel RDMA and HTM friendly key-value storage library.Specifically,an RHKV client transmits requests to our improved Cuckoo hashing scheme – G-Cuckoo by constructing a Cuckoo graph as directed pseudoforests in RHKV server.The server maintains a bucket-to-vertex mapping and pre-determines the possibility of a loop prior to insertion.Through the use of this Cuckoo graph,the endless loop of insertions that can potentially be experienced in the case of generic Cuckoo hashing can be detected.Moreover,RHKV strives to utilize HTM technique to ensure data operation's atomicity.Comparative performance evaluation is conducted by leveraging YCSB workloads.(4)With the explosion of specialized information in critical infrastructure systems,providing discriminative services becomes a concern.Once massive data analytics is conducted in a standalone server,the performance will degenerate tremendously.We propose the general solution of EDAWS: a Novel Distributed Framework with Efficient Data Analytics Workspace towards Discriminative Service for Critical Infrastructures.The server-side platform facilitates native data capture,storage,index and data mining with a systematic organization.The server-side platform could be accessed by mobile-side clients remotely in a more convenient way.To demonstrate our solution,a case study of smart residence prototype towards discriminative services is thoroughly discussed.The extensively experimental studies are conducted for the prototype system over real-world datasets.Experimental results indicate that,data processing which runs on computing nodes has good scalability with data sizes and computing nodes,and the prototype passes from original data to discriminative services intelligently.
Keywords/Search Tags:Data-Driven, Distributed System, Software Mechanism, Distributed Data Parallel Computing Framework, Software Construction Technique, Key-Value Data Storage, Big Data Service
PDF Full Text Request
Related items