Font Size: a A A

Research On Data Cleaning Using Web Information

Posted on:2014-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y K LiFull Text:PDF
GTID:2268330422451692Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As time goes by, more and more data is produced and stored in the informationsystems, some even reached TB or PB level. While some factor for example,out-dated, wrong value, duplication and confliction affect the quality of data. All thedata clean methods are based on known FD/CFD, while in fact, not all the FD/CFDsare known in advance. To solve the data quality problems and overcome thedrawback of current approaches, a Web-based data clean method is proposed in thispaper.There are vast amounts of information on the internet can be used to supportdata cleaning work. When a tuple is checked whether it can meet the qualityrequirements, some text patterns can be obtained on the internet based on therelation data to support the judgment. And the information on the interment can alsobe used to clean the data that have quality problems.To propose the Web-based data cleaning framework, firstly, the definitions ofdata quality problems, data clean contents and Web-based data cleansing techniquesare given. Then we classify the data. The correct tuples are used to search the textpatterns, and the data with quality problems are to be cleaned by using the textpatterns and the information on the Web.The Web-based data cleansing framework are made by three parts. The firstpart is data quality problem detection, the second part is text pattern access, and thelast part is data cleansing based on the text pattern. The data quality problemdetection part interacts with the others. The text pattern access part containskeywords generation and pattern generation. The data cleansing contains feasibletuples selection, tuple cleaning and other strategies.At last, abundant experiments have been done to verify the efficiency andeffectiveness of the algorithms proposed in this paper.
Keywords/Search Tags:data cleaning, entity resolution, Web information, text patterndependency
PDF Full Text Request
Related items