Font Size: a A A

Functional Dependencies Discovery Based On Sampling

Posted on:2020-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:C X GuFull Text:PDF
GTID:2428330578983459Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In relational databases,function dependency discovery is a very important database analysis technology,which has a wide range of applications in the fields of knowledge discovery,database semantic analysis,data quality assessment and database design.In traditional centralized data sets,the study of function-dependent discovery has been very thorough.However,with the development of the times and the arrival of the era of big data,the total amount of data information has increased geometrically,and the scale of the database has also grown rapidly.In the past,centralized data sets have been restricted for various reasons such as physical equipment,and in some occasions.It is no longer sufficient to meet the needs of the scene.In this context,distributed database system is created,which is more maintainable,more scalable,and more fault-tolerant than a centralized database.However,at the same time,distributed databases also bring more complicated problems in data processing and management.Knowledge discovery for centralized databases does not apply to distributed databases.However,the existing function-dependent discovery algorithms for distributed data sets can correctly perform function-dependent discovery on distributed data sets,but the main verification methods are still concentrated after the data is migrated,and the efficiency is low.Therefore,the main research content of this paper is parallel function dependency discovery on distributed data sets.This article starts with the following aspects to achieve efficient function dependency discovery:?1?Using the method of sampling verification,first verify the candidate functional dependency on the sampling dataset on the master node.If the candidate functional dependency doesn't holds on the sampling dataset,it doesn't hold on the complete dataset,according to the theorem.Since this functional dependency must not hold on the complete set without verification,thereby saving the overhead of communication,task assignment,etc.,which is required for the function to perform global verification,thereby improving efficiency.?2?Using the Fk-1×Fk-1-1 algorithm originally used for frequent pattern mining to generate candidate function dependencies,this method takes up less storage space than the prefix tree record generation method,thereby saving the application and releasing the storage.The time of space can also avoid the shortage of storage space.?3?In the efficient distributed computing framework Spark,design a distributed function discovery algorithm suitable for the framework,so that it can perform function dependency discovery and efficient use of calculations on each node of the distributed data set.Resources to increase efficiency.Finally,the experimental results show that the proposed framework has good feasibility and effectiveness.Experimental results show that the framework can efficiently perform function dependency discovery in distributed situations.
Keywords/Search Tags:functional dependency, knowledge discovery, parallel computing
PDF Full Text Request
Related items