Font Size: a A A

Research On Functional Dependencies Mining Algorithm Based On Attribute Partition Information Gain

Posted on:2020-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:X Y JiangFull Text:PDF
GTID:2428330590971748Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the further development of the Internet era,data has become an emerging means of production.At present,the information systems of various industries contain a large amount of data,especially relational data.These data often has error,and it is difficult to be used effectively.Therefore,people hope to find some effective strategies to correct the data,and in relational data functional dependencies plays an important role in data repair.Functional dependency is an important concept in relational models that can be used for pattern generalization,data cleansing,data repair,data integration and more.The functional dependency discovery method under relational data has been studied for decades,and various function dependency mining methods have been proposed,but some problems still remain.For example,when mining functional dependencies in a database instance with a large number of attributes,the algorithm speed is still not ideal.In recent years,traditional discovery algorithms such as depth-first traversal of DFD have an exponential increase in time complexity.Aiming at solving this problem,this thesis proposed the concept of attribute partition information gain,combining the original DFD functional dependency discovery algorithm with the focused sampling method in HYFD algorithm.It is preferred to use the information gain list between attribute partitions to improve the random walk selection strategy of the next node in the original DUCC algorithm,so as to find the unique attribute combination MUC,and then sampling the dataset by the focused sampling processing method to obtain the non-functional dependency.Finally,the single attribute primary key node,the non-single attribute primary key,and the non-functional dependency node route are pruned,and the starting route of the original DFD algorithm is selected with reference to the information gain list,so that the improved algorithm is theoretically superior to the original algorithm.Finally,this thesis validated the algorithm using the public dataset under Metanome,and developed an excel plugin that can automatically detect and repair data.The experimental results show that the functional dependency mining algorithm based on attribute partition information gain is faster than the original DFD.When the number of records and the number of attributes of the data set is large,the improved algorithm is more robust than the original algorithm.At the same time,due to the focused sampling processing method,when the improved algorithm has a larger calculation dataset,its memory consumption is smaller than the original DFD algorithm.
Keywords/Search Tags:functional dependency, attribute partition, information gain, relational database
PDF Full Text Request
Related items