Research On The Detection And Cleaning Of XML Similar Duplicate Data

Posted on:2019-06-23

Degree:Master

Type:Thesis

Country:China

Candidate:X D Yang

Full Text:PDF

GTID:2438330566490182

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,the rapid development of Internet information technology has brought great convenience to individuals,enterprises,government departments,and all aspects of society.A large amount of electronic data has been created,and the role of data in various fields has become more significant.As a typical semi-structured data,XML type data has huge application prospects in various fields because of its own extensibility,self-descriptive and other characteristics.It has become the standard of data exchange and transmission in information systems.When multiple different users or applications describe real-entity objects in XML format,the same entity object will get different XML data descriptions,because the data representation form of XML data is flexible,causing a lot of similar duplicate data in the XML domain.This problem generates a lot of redundant information,reducing data availability and wasting storage space.The current research hotspot for XML data quality problem is the similar duplicate data cleaning,and the focus of data cleaning is the detection and removal of duplicate data.The existing methods can further improve the detection efficiency and cleaning accuracy of XML duplicate data.The duplication problem of XML data is studied in this dissertation.The research focuses on the detection and removal of duplicate data,the purpose of this study is to improve the accuracy and cleaning efficiency of XML repeat data.Mainly studied the following:For the problem of purging of duplicate date,the traditional Sorted Neighborhood Method(SNM)is optimized and the ICSNM method is proposed in this dissertation.Simulation experiments show that the ICSNM has improved efficiency and evaluation indicators over the original SNM method,making data cleaning more accurate and efficient.For the detection of duplicate XML data,a Bayesian network-based recognition method to construct a Bayesian network for XML data has been designed.When identifying whether two XML objects are duplicated,the method not only considers the repetition probability of the child nodes,but also considers the probability of repetition of all descendants.Experiments show that the detection method based on Bayesian network has higher detection accuracy than the original method,which can better detect similar duplicate data in the data set.Based on the previous research work,a tool method X-SNM for XML repeated data cleaning was designed and proposed.The comparison with the DogmatiX method demonstrated that X-SNM has obvious advantages over DogmatiX in terms of precision,recall and time efficiency.

Keywords/Search Tags:

Data Cleaning, XML Data, Duplication Detection, SNM, Bayesian Networks

PDF Full Text Request

Related items

1	Domain-independent de-duplication in data warehouse cleaning
2	Uncertainty Bayesian Networks Based On Data Cleaning
3	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
4	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
5	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
6	An Analysis On Data Cleaning Algorithm And Its Application On Web Logs Processing
7	Research On Some Problems In Learning Bayesian Network
8	Research On Data Organization For Data De-duplication System
9	Research On Data Cleaning And Model Evaluation Based On Data Mining
10	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment