Font Size: a A A

Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Functional Dependency Discovery

Posted on:2019-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2428330545476736Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Functional Dependency(FD)is the most common constraint in the relational database,which indicates the dependencies between attributes.FD is widely used in database analysis and design,query optimization,schema normalization,data integration,data cleaning and so on.However,the FD is usually unknown and almost impossible to be detected manually for specific data sets.Therefore,the research of the FD algorithm has been a hot topic in the database field.In recent years,with the rapid development of Internet and other information technology,all walks of life have entered the era of big data.In big data scenarios,the computation and memory ovearhead lead the classic stand-alone algorithms such as TANE,FUN,DFD,Dep-Miner,FastFDs,Fdep and HyFD to be very inefficient.In order to solve these problem,some researchers have proposed distributed FD discovery algorithms,such as MFDM,HFDD and FDPar_Discover.However,they have two drawbacks.First,they don't scale well with the number of attributes.Second,the load balancing problem is not considered.Therefore,it is urgent to study an efficient,scalable and distributed FD discovery algorithm for big data.In order to solve the shortcomings of the existing work,we proposes a distributed FD discovery algorithm called SmartFD,which aims to reduce the cost of computation and memory and improve the utilization of computing resources.The main contents and main contributions of this paper are as follows:(1)We propose a sorting algorithm for attributes,which can improve the utilization of computing resources in the distributed FD discovery statge.(2)We propose a cluster-aware distributed sampling algorithm and prune-generate algorithm to efficiently prune invalid FDs and generate candidate FDs.(3)We propose a distributed verification method based on lightweight index.This method uses lightweight index to reduce computation and memory overhead in the verification process.(4)We propose an adaptive handover strategy.The strategy improves the overall efficiency of the algorithm by adaptive switching between sampling and verification.(5)We implements the SmartFD algorithm based on Apache Spark,and compares the performance with the HyFD and HFDD algorithms on a number of different datasets The experimental results show that the SmartFD algorithm obtains the superlinear acceleration ratio on large scale datasets compared with the HyFD algorithm,and the acceleration ratio is 2.5?455.7 times compared with the HFDD algorithm.At the same time,SmartFD also shows near linear row scalability and node scalability.In addition,the growth trend of running time is almost the same as that of FD,hence SmartFD achieves better column scalability.
Keywords/Search Tags:Big Data, Functional Dependency Discovery, Distributed Sampling, Distributed Verification
PDF Full Text Request
Related items