Font Size: a A A

Improve Repairing Data Process By Using Multiple Based-rules

Posted on:2020-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y FengFull Text:PDF
GTID:2428330596998040Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of technology and computation theory and the big data era's coming,the improvement of data quality now becomes drawing more attentions.Data quality is determined by data continency and data accuracy.Data quality can derectly determine the data processing and data analysis based on the data base,and what's more,it also influences the conclusions based on the data.For example,when scientists dealing with the data from the websites,data accuracy and the accuracy between the data and users are the foundamental elements for the data analysis.If scientists got unaccuracy data which cannot promise the data quality,they cannot get corresponding credible conclusions from data analysis even by using multiple tools.In the economic society,data quality directly or indirectly influences the incomes of the industry.Maintain the data in the database regularly is helpful to keep a much higher data quality.This paper doing research is based on the dataset from SSG company,which is a company dealing with the products between the producer and the consumers.The research considers the specific column reference,its complexity structure and its feature between the data.And moreover,the paper gives the solution to improve the quality of the similar data in the relational database.This solution is inspired by the traditional repairing solutions which using rules to deal with data.The technology of functional dependency is used to the design of database,and this relationship now extends to find conditional functional dependencies to be the new rules to be the constraint of the tuples of the data.When the tradional rules combined with the dependencies,the rules will improve the efficient of repairing process and will perform better on the repairing speed.This method is proposed by considering the real world data.Besides,based on the machine learning models,this paper gives solution to deal with the data with missing values which not being discussed in the previous research.So in this paper,we represent a method cosidering the relation rules to deal with this problem.The paper's mainly works are introduced as follows:1)To deal with the data repairing problem,this paper proposes a method based on multiple rules.To make the search for the rules faster,this method not only relies on the regular expressions as repairing rules,but consider the relationship between different columns.This method improve the algorithm TANE to get conditional functional dependencies(CFDs).Combining the regular expression and CFDs together to make up the reparing rules.Adopting the thought from algorithm RSR on using finite automaton and the choose of operator,our method give a better solution to repair the data.Through the data from company,the test of the algorithm shows the results that based on the different frequency and the repair result on certain frequency.The result shows that this method can efficiently repair the data and improve the speed.2)To deal with the missing value,this paper proposes a method based on neural network model.This method considering the related rules and connection between tuples with missing values and tuples without missing values,proposes a method based on the neural network model.By using experiment,compared with other models,this method can efficiently solve this problem.
Keywords/Search Tags:data quality, regular expression, conditional functional dependency, RSR algorithm, machine learning
PDF Full Text Request
Related items