Font Size: a A A

Research On General Parallelization Strategy For Functional Dependency Discovery Algorithms

Posted on:2021-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:P Z WuFull Text:PDF
GTID:2428330602999057Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Functional dependencies(FDs)are important metadata that describe relationships among columns of datasets and can be used in a number of tasks,such as schema nor-malization,data cleansing.Various single-node or parallel distributed algorithms for FD discovery have been proposed.Previous single-node algorithms are efficient when datasets are small and stored locally,but lack the ability of parallelization.However,on the other hand,existing parallel distributed algorithms bring huge communication costs and thus perform not well enough.To solve this problem,in this dissertation,we do three parts of work as follows:First,we propose FD-Combine algorithm to solve "how-to-infer" problem.The most important costs of existing parallel algorithms are huge intermediate result com-munication costs.We try to use data parallelism to avoid intermediate result communi-cation,so that we face "how-to-infer" problem to achieve it.The "how-to-infer" prob-lem is how to infer the final FDs from part-FD sets of parts of records.We construct BL system and basic term expression to formalize the expressions and calculations of FD,non-FD,part-FD.In BL system,we propose a FD-Combine algorithm.When the small datasets meet certain conditions,FD-Combine can infer the final FDs from part-FD sets.Then we compare FD-Combine algorithm with other induction algorithms to get former comprehension,and we give implementations of FD-Combine and their time complexities.Second,we redesign and choose affine plane block design algorithm to slove "how-to-divide" problem."how-to-divide" problem is how to divide the original dataset into multiple smaller datasets,making sure that smaller datasets meet the FD-Combine con-ditions,and the datasets are as small as possible,the dividing algorithm is as efficient as possible.We change the problem to block design problem,and use data point to let each block design with construction can be dividing algorithm.Then,we design data rate criteria to measure the effectiveness of block designs.We finally choose affine plane block design algorithm for three reason:Its data rate is close to lower bound,it is easy to be implemented,and the number of its blocks is better in parallel computing.At last,we analyze its time complexity.Third,we propose a general parallel discovery strategy,consisting of FD-Combine algorithm and affine plane block design algorithm,to achieve data parallelism parallel computing.With our strategy,each single-node FD discovery algorithm can be directly parallelized without modification in distributed environments.We prove that in some conditions,the speedup of p threads will be((?),p).We use Spark to manage threads,evaluate this strategy with different algorithms and datasets.Evaluations show that,the speedups of column-efficient algorithms are usually nearly(?)or even exceed p/2.For example,in letter dataset,the speedups of FastFDs are 5.0 with 6 threads,10.5 with 20 threads.In our evaluation,the best multi-threaded algorithm HYFD also gets a significant improvement with our strategy.In lineitem dataset,with 54 threads,the speedup of original HYFD is 2.9 and it reaches 4.9 with our strategy.
Keywords/Search Tags:FD Discovery, Data Parallelism, FD-Combine, Block Design, General Distributed Strategy
PDF Full Text Request
Related items