Font Size: a A A

Research On Data Cleaning Based On User Feedback

Posted on:2014-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:H XieFull Text:PDF
GTID:2268330422950609Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In today’s information age vast amounts of data generated today, a largenumber of poor-quality data exists in the database management system. Most of thedata cleaning methods focus on providing fully automated solutions using differentheuristics to generate data repair. But they can’t guarantee the correctness of repair.In this paper, we propose an active learning model based on user feedback. Weapply the model to inconsistency repair and multi-source truth discovery. Repairtechniques based on user feedback not only can minimal training sets to get moreaccurate machine learning models, but also can significantly improve thecorrectness of data repair.For inconsistency repair, we present our design of a data cleaning framework thatcombines interaction of data quality rules (CFDS, CINDS and MDs) with userfeedback through an interactive process. First, to generate candidate repairs for eachpotentially dirty attribute, we propose an optimization model based on geneticalgorithm. We then create a Bayesian machine learning model with severalcommittees to predict the correctness of the repair and rank these repairs byuncertainly score to improve the learned model. User feedback is used to decidewhether the model is accurate while inspecting the suggestions. Finally, ourexperiments on real-world datasets show significant improvement in data quality.For truth discovery, we first present the basic voting algorithm calledand then propose the na ve truth discovery framework based on user feedbackcalled. The framework combines the active machine learning model baseduser feedback with algorithm. It generates candidate true values byalgorithm and confirms these true values through active learning model based userfeedback. So the accuracy of truth discovery has been greatly improved. We thenpropose a more complete truth discovery framework called which considersthe trustworthiness of the data source. Our experiments show that UFTFsuccessfully finds true value among multiple data source.
Keywords/Search Tags:data clean, user feedback, machine learning, truth discovery
PDF Full Text Request
Related items