The Key Issue On Data Cleaning In Web Data Integration

Posted on:2010-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:H J Zhang

Full Text:PDF

GTID:2178360278473280

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, network has become an important means for information dissemination and exchange, and abundant data sources have appeared in the Web. For the purpose of better information sharing, Web Data Integration has been a very hot topic in data management and other correlative fields. On the other hand, due to the characteristic of Web data: semi-structured, autonomous and updating rapidly, there are a lot of "dirty data" in Data Integration which impacts seriously the credibility and availability of integrated data. Thus, it is a new challenge to researchers how to clean the information in Web Data Integration. Based on the above mentioned analysis, this paper studies the key problems on data cleaning in Web Data Integration.First of all, this paper introduces the definitions,basic principle,operation procedure,evaluative criteria of data cleaning and the faults of current tools. Furthermore, the data cleaning technology is introduced; the paper studies the cleaning method and process of incomplete data, abnormal data and duplicate records.On the basis of analysis of current duplicate detection algorithm, the paper present detecting approximately duplicate database records method based on weight rank, each property of the data should be endowed with certain weight in the light of the rank-based weights method, according to the thought of ranking, choose some certain key field or some words of the field to divide large data set into many non-intersected small data sets, and then detect and eliminate approximately duplicated records in each small data set, with the introduction of the above steps that should be repeated with other key field or some words of the field. The experiment shows that such algorithm not only has a good detecting precision, but also has better efficiency of time.In the end, according to the characteristic of Web data in Web data integration, a data-cleaning framework based on Web is presented. This framework mainly uses the characteristic of XML to complete the pre-processing to data cleaning as long as XML mapping to database, which makes the data be elements and standardization, and improves the efficiency of data cleaning. This framework also deals with the data filtrated from web information extraction to detect the duplicate records with the algorithm of cleaning the duplicate records which is studied above, and presents out the results and analysis of the experiment.

Keywords/Search Tags:

Data Cleaning, Data Integration, Duplicate Detection, XML

PDF Full Text Request

Related items

1	Research On Duplicate Detection And Cleaning Of Uncertain Data
2	Research On The Detection And Cleaning Of XML Similar Duplicate Data
3	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
4	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
5	The Research Of Data Cleaning In Web Information Integration
6	Study On Data Cleaning Based On XML And Its Application
7	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
8	Rule-Based Interactive Data Cleaning Technique
9	Design And Implementation Of Data Preprocessing System Oriented To Data Mining
10	Research On Cleaning Method For XML Similarity Duplicate Data