Font Size: a A A

Spark-based Distributed Functional Dependency Discovery Algorithm

Posted on:2021-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhuFull Text:PDF
GTID:2518306512987829Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Functional dependency discovery is an important analysis technology for relational data,and an important means of data cleaning,quality assessment and semantic analysis.It is widely used in distributed big data analysis.It is of high computational complexity to find the functional dependency correctly in the data.Most of the existing functional dependency discovery methods are centralized algorithms.In large enterprises,due to the rapid growth of user data,distributed databases based on cloud computing platform are widely used.The existing functional dependency discovery algorithm is mainly focused on centralized data and not applicable to cloud data distributed on different nodes.It is impossible to use distributed cloud computing platform to accelerate the processing speed by gathering distributed data into centralized nodes for unified processing.Using the traditional centralized method to process the data on the distributed nodes separately will lead to the wrong result.There are few distributed algorithms and some problems such as large memory consumption and unbalanced load.Several kinds of centralized functional dependency discovery algorithms with different memory and processor consumption characteristics have been proposed.To distribute the centralized functional dependency discovery algorithm,it is necessary to design the distributed processing strategy carefully according to the characteristics of the centralized algorithm,so as to ensure the accuracy and improve the processing efficiency.At present,some distributed functional dependency discovery algorithms have been proposed for partial centralized algorithms,but the problems of large memory consumption and unbalanced load exist.Meanwhile,existing distributed algorithms mainly adopt traditional Map Reduce computing platform,while Spark distributed computing platform adopts memory computing technology to effectively reduce the storage time of intermediate data and speed up processing.Therefore,based on the existing centralized algorithm,this paper proposes four different distributed functional dependency discovery algorithms with different characteristics based on Spark platform.A series of distributed task processing strategies and optimization methods are proposed,and the effectiveness of the proposed algorithm is proved by sufficient experiments.The main work is as follows :(1)for the Spark platform,a distributed functional dependency discovery method based on spatial traversal is proposed,and data migration and search strategies are designed.Thus,the load balance of each node task and the minimum functional dependency are guaranteed.A distributed algorithm without redistributing the original data is proposed to reduce the amount of data migration and the running time of the algorithm;(2)for the Spark platform,a distributed functional dependency discovery method based on consensus set was proposed,and two task partitioning strategies and tuples deduplication strategies were designed,Thus,the parallel execution of tasks is guaranteed and the number of generated tuples is reduced;(3)Test and analyze the proposed algorithm on multiple data sets(including composite data sets and real data sets);Verify the correctness of the distributed algorithm and the effectiveness of the correlation optimization method.At the same time,all the proposed distributed functional dependency discovery algorithms are compared in many aspects,including time consumption,memory consumption and data migration.The characteristics of different algorithms and their applicable scenarios are described.
Keywords/Search Tags:data mining, functional dependency, Spark, distributed computing
PDF Full Text Request
Related items