Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Functional Dependency Discovery

Posted on:2019-01-22

Degree:Master

Type:Thesis

Country:China

Candidate:Q Wang

Full Text:PDF

GTID:2428330545476736

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Functional Dependency(FD)is the most common constraint in the relational database,which indicates the dependencies between attributes.FD is widely used in database analysis and design,query optimization,schema normalization,data integration,data cleaning and so on.However,the FD is usually unknown and almost impossible to be detected manually for specific data sets.Therefore,the research of the FD algorithm has been a hot topic in the database field.In recent years,with the rapid development of Internet and other information technology,all walks of life have entered the era of big data.In big data scenarios,the computation and memory ovearhead lead the classic stand-alone algorithms such as TANE,FUN,DFD,Dep-Miner,FastFDs,Fdep and HyFD to be very inefficient.In order to solve these problem,some researchers have proposed distributed FD discovery algorithms,such as MFDM,HFDD and FDPar_Discover.However,they have two drawbacks.First,they don't scale well with the number of attributes.Second,the load balancing problem is not considered.Therefore,it is urgent to study an efficient,scalable and distributed FD discovery algorithm for big data.In order to solve the shortcomings of the existing work,we proposes a distributed FD discovery algorithm called SmartFD,which aims to reduce the cost of computation and memory and improve the utilization of computing resources.The main contents and main contributions of this paper are as follows:(1)We propose a sorting algorithm for attributes,which can improve the utilization of computing resources in the distributed FD discovery statge.(2)We propose a cluster-aware distributed sampling algorithm and prune-generate algorithm to efficiently prune invalid FDs and generate candidate FDs.(3)We propose a distributed verification method based on lightweight index.This method uses lightweight index to reduce computation and memory overhead in the verification process.(4)We propose an adaptive handover strategy.The strategy improves the overall efficiency of the algorithm by adaptive switching between sampling and verification.(5)We implements the SmartFD algorithm based on Apache Spark,and compares the performance with the HyFD and HFDD algorithms on a number of different datasets The experimental results show that the SmartFD algorithm obtains the superlinear acceleration ratio on large scale datasets compared with the HyFD algorithm,and the acceleration ratio is 2.5?455.7 times compared with the HFDD algorithm.At the same time,SmartFD also shows near linear row scalability and node scalability.In addition,the growth trend of running time is almost the same as that of FD,hence SmartFD achieves better column scalability.

Keywords/Search Tags:

Big Data, Functional Dependency Discovery, Distributed Sampling, Distributed Verification

PDF Full Text Request

Related items

1	Spark-based Distributed Functional Dependency Discovery Algorithm
2	Functional Dependencies Discovery Based On Sampling
3	Research On General Parallelization Strategy For Functional Dependency Discovery Algorithms
4	Research On Micro Inconsistencies Of Data
5	Research On The Semantic-based Distributed Data And Service Unified Discovery
6	Research On The Key Technologies Of Distributed Big Data Consistency Management
7	Distributed Knowledge Discovery for Diverse Dat
8	Research On Techniques Of A Functional Point Model-based Semi-formal Functional Verification
9	Distributed Reasoning Research Based On Named Data Network
10	Research On Distributed Knowledge Discovery System Based Mobile Agent