Font Size: a A A

The Key Issue On Data Cleaning In Web Data Integration

Posted on:2010-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:H J ZhangFull Text:PDF
GTID:2178360278473280Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, network has become an important means for information dissemination and exchange, and abundant data sources have appeared in the Web. For the purpose of better information sharing, Web Data Integration has been a very hot topic in data management and other correlative fields. On the other hand, due to the characteristic of Web data: semi-structured, autonomous and updating rapidly, there are a lot of "dirty data" in Data Integration which impacts seriously the credibility and availability of integrated data. Thus, it is a new challenge to researchers how to clean the information in Web Data Integration. Based on the above mentioned analysis, this paper studies the key problems on data cleaning in Web Data Integration.First of all, this paper introduces the definitions,basic principle,operation procedure,evaluative criteria of data cleaning and the faults of current tools. Furthermore, the data cleaning technology is introduced; the paper studies the cleaning method and process of incomplete data, abnormal data and duplicate records.On the basis of analysis of current duplicate detection algorithm, the paper present detecting approximately duplicate database records method based on weight rank, each property of the data should be endowed with certain weight in the light of the rank-based weights method, according to the thought of ranking, choose some certain key field or some words of the field to divide large data set into many non-intersected small data sets, and then detect and eliminate approximately duplicated records in each small data set, with the introduction of the above steps that should be repeated with other key field or some words of the field. The experiment shows that such algorithm not only has a good detecting precision, but also has better efficiency of time.In the end, according to the characteristic of Web data in Web data integration, a data-cleaning framework based on Web is presented. This framework mainly uses the characteristic of XML to complete the pre-processing to data cleaning as long as XML mapping to database, which makes the data be elements and standardization, and improves the efficiency of data cleaning. This framework also deals with the data filtrated from web information extraction to detect the duplicate records with the algorithm of cleaning the duplicate records which is studied above, and presents out the results and analysis of the experiment.
Keywords/Search Tags:Data Cleaning, Data Integration, Duplicate Detection, XML
PDF Full Text Request
Related items