The Research Of Data Cleaning In Web Information Integration

Posted on:2008-08-29

Degree:Master

Type:Thesis

Country:China

Candidate:H Liu

Full Text:PDF

GTID:2178360215974396

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The demand for data cleaning has a long history. The study of cleaning technology has been a very hot topic in data management field and other correlative fields. In this thesis, the main research is how to solute "the dirty data" in the web information integration, and focus on the detection to the duplicate records and the correlative algorithms, thus a solution that can eliminate the dirty data and ensure the quality of integration data is given.In this dissertation, the definitions of data quality and other correlative concepts are discussed firstly. Furthermore, theories and methods of data cleaning technology are summarized, and the evaluation criteria are put forward. Compared with the general steps of data cleaning, two frameworks of data cleaning are given. One is unrelated to fields and based on metadata, and the other is related and based on field knowledge. What is more, this dissertation also introduces the data cleaning technology to the incomplete data, abnormal data and duplicates records. At last, definitions and instances of data cleaning, the general steps of cleaning data, basic processes and the adoptable method are all given.This dissertation studies the key algorithms relating to all steps in the processing of duplicate records cleaning, mainly including field matching algorithm based on edit distance, Pair-Wise algorithm to compare the records matching, SNM algorithm to detect the duplicate records. The basic theories and complexities of all the algorithms are introduced. Then an improved SNM algorithm is given. This dissertation also introduces the rules of merger /deletion duplicate records.According to the characteristic of Web data in Web information integration, a data-cleaning framework based on Web is presented. This framework mainly uses the characteristic of XML to complete the pretreatment to data cleaning as long as XML mapping to database, which makes the data become elements and standardization, and improves the efficiency of data cleaning. This framework also deals with the data filtrated from web information extraction to detect the duplicate records with the algorithm of cleaning the duplicate records which is studied above, and presents out the results and analysis of the experiment.At last, this dissertation presents a duplicate record detection method based on Chinese. With this method, we can divide Chinese words and match words based on semantics mainly according to the characteristics of Chinese, and improve the efficiency of matching records.Nowadays, data cleaning has had a very great development in the field of data warehouse, but the researcher home and abroad still do not present a general data-cleaning framework based on Web. Due to the characteristic of Web data, the Web-based data cleaning is different to the cleaning based on relation database, and there are concepts of XML key and XML comparability abroad. With the development of Web information integration, the Web based data cleaning will be paid more attention to.

Keywords/Search Tags:

information integration, web data, data cleaning, duplicate records

PDF Full Text Request

Related items

1	Design And Implementation Of Customer Information Cleaning In CRM System
2	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
3	Research On Duplicate Records Identification Model In Deep Web
4	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
5	Research On Detection Of Approximate Duplicate Records For Massive Data
6	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
7	Study Of Data Cleaning Algorithms Based On Data Warehouse
8	Research On Data Cleaning Method Based On Optimal Feature Selection
9	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
10	Research On Data Cleaning Algorithm Based On Clustering