Font Size: a A A

The Research Of Data Cleaning In Web Information Integration

Posted on:2008-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2178360215974396Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The demand for data cleaning has a long history. The study of cleaning technology has been a very hot topic in data management field and other correlative fields. In this thesis, the main research is how to solute "the dirty data" in the web information integration, and focus on the detection to the duplicate records and the correlative algorithms, thus a solution that can eliminate the dirty data and ensure the quality of integration data is given.In this dissertation, the definitions of data quality and other correlative concepts are discussed firstly. Furthermore, theories and methods of data cleaning technology are summarized, and the evaluation criteria are put forward. Compared with the general steps of data cleaning, two frameworks of data cleaning are given. One is unrelated to fields and based on metadata, and the other is related and based on field knowledge. What is more, this dissertation also introduces the data cleaning technology to the incomplete data, abnormal data and duplicates records. At last, definitions and instances of data cleaning, the general steps of cleaning data, basic processes and the adoptable method are all given.This dissertation studies the key algorithms relating to all steps in the processing of duplicate records cleaning, mainly including field matching algorithm based on edit distance, Pair-Wise algorithm to compare the records matching, SNM algorithm to detect the duplicate records. The basic theories and complexities of all the algorithms are introduced. Then an improved SNM algorithm is given. This dissertation also introduces the rules of merger /deletion duplicate records.According to the characteristic of Web data in Web information integration, a data-cleaning framework based on Web is presented. This framework mainly uses the characteristic of XML to complete the pretreatment to data cleaning as long as XML mapping to database, which makes the data become elements and standardization, and improves the efficiency of data cleaning. This framework also deals with the data filtrated from web information extraction to detect the duplicate records with the algorithm of cleaning the duplicate records which is studied above, and presents out the results and analysis of the experiment.At last, this dissertation presents a duplicate record detection method based on Chinese. With this method, we can divide Chinese words and match words based on semantics mainly according to the characteristics of Chinese, and improve the efficiency of matching records.Nowadays, data cleaning has had a very great development in the field of data warehouse, but the researcher home and abroad still do not present a general data-cleaning framework based on Web. Due to the characteristic of Web data, the Web-based data cleaning is different to the cleaning based on relation database, and there are concepts of XML key and XML comparability abroad. With the development of Web information integration, the Web based data cleaning will be paid more attention to.
Keywords/Search Tags:information integration, web data, data cleaning, duplicate records
PDF Full Text Request
Related items