Research On General Parallelization Strategy For Functional Dependency Discovery Algorithms

Posted on:2021-01-13

Degree:Master

Type:Thesis

Country:China

Candidate:P Z Wu

Full Text:PDF

GTID:2428330602999057

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Functional dependencies(FDs)are important metadata that describe relationships among columns of datasets and can be used in a number of tasks,such as schema nor-malization,data cleansing.Various single-node or parallel distributed algorithms for FD discovery have been proposed.Previous single-node algorithms are efficient when datasets are small and stored locally,but lack the ability of parallelization.However,on the other hand,existing parallel distributed algorithms bring huge communication costs and thus perform not well enough.To solve this problem,in this dissertation,we do three parts of work as follows:First,we propose FD-Combine algorithm to solve "how-to-infer" problem.The most important costs of existing parallel algorithms are huge intermediate result com-munication costs.We try to use data parallelism to avoid intermediate result communi-cation,so that we face "how-to-infer" problem to achieve it.The "how-to-infer" prob-lem is how to infer the final FDs from part-FD sets of parts of records.We construct BL system and basic term expression to formalize the expressions and calculations of FD,non-FD,part-FD.In BL system,we propose a FD-Combine algorithm.When the small datasets meet certain conditions,FD-Combine can infer the final FDs from part-FD sets.Then we compare FD-Combine algorithm with other induction algorithms to get former comprehension,and we give implementations of FD-Combine and their time complexities.Second,we redesign and choose affine plane block design algorithm to slove "how-to-divide" problem."how-to-divide" problem is how to divide the original dataset into multiple smaller datasets,making sure that smaller datasets meet the FD-Combine con-ditions,and the datasets are as small as possible,the dividing algorithm is as efficient as possible.We change the problem to block design problem,and use data point to let each block design with construction can be dividing algorithm.Then,we design data rate criteria to measure the effectiveness of block designs.We finally choose affine plane block design algorithm for three reason:Its data rate is close to lower bound,it is easy to be implemented,and the number of its blocks is better in parallel computing.At last,we analyze its time complexity.Third,we propose a general parallel discovery strategy,consisting of FD-Combine algorithm and affine plane block design algorithm,to achieve data parallelism parallel computing.With our strategy,each single-node FD discovery algorithm can be directly parallelized without modification in distributed environments.We prove that in some conditions,the speedup of p threads will be((?),p).We use Spark to manage threads,evaluate this strategy with different algorithms and datasets.Evaluations show that,the speedups of column-efficient algorithms are usually nearly(?)or even exceed p/2.For example,in letter dataset,the speedups of FastFDs are 5.0 with 6 threads,10.5 with 20 threads.In our evaluation,the best multi-threaded algorithm HYFD also gets a significant improvement with our strategy.In lineitem dataset,with 54 threads,the speedup of original HYFD is 2.9 and it reaches 4.9 with our strategy.

Keywords/Search Tags:

FD Discovery, Data Parallelism, FD-Combine, Block Design, General Distributed Strategy

PDF Full Text Request

Related items

1	The Research And Implementation Of Key Techniques On Block Cipher ASIP
2	Research On Efficient Distributed Parallel Algorithm Of Deep Learning Framework Tensorflow
3	Optimization Of Distributed Training Strategies For Deep Learning Networks
4	Research On Continuous Fuzzing Technology With Large-Scale Distributed Parallel
5	Runtime Optimization For Large-Scale Neural-Network Data-Parallelism Training
6	Parallel Co-processor Architecture Research And Instrcution System Design For Block Ciphers
7	Research On The Semantic-based Distributed Data And Service Unified Discovery
8	Seeking parallelism in discovery programs
9	Designand Application Of Ge'ermu Gas-Steam Combine Circulation Powerplant
10	A unified framework for transparent parallelism and fault-tolerance in distributed systems