Font Size: a A A

Research On The Detection And Cleaning Of XML Similar Duplicate Data

Posted on:2019-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:X D YangFull Text:PDF
GTID:2438330566490182Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of Internet information technology has brought great convenience to individuals,enterprises,government departments,and all aspects of society.A large amount of electronic data has been created,and the role of data in various fields has become more significant.As a typical semi-structured data,XML type data has huge application prospects in various fields because of its own extensibility,self-descriptive and other characteristics.It has become the standard of data exchange and transmission in information systems.When multiple different users or applications describe real-entity objects in XML format,the same entity object will get different XML data descriptions,because the data representation form of XML data is flexible,causing a lot of similar duplicate data in the XML domain.This problem generates a lot of redundant information,reducing data availability and wasting storage space.The current research hotspot for XML data quality problem is the similar duplicate data cleaning,and the focus of data cleaning is the detection and removal of duplicate data.The existing methods can further improve the detection efficiency and cleaning accuracy of XML duplicate data.The duplication problem of XML data is studied in this dissertation.The research focuses on the detection and removal of duplicate data,the purpose of this study is to improve the accuracy and cleaning efficiency of XML repeat data.Mainly studied the following:For the problem of purging of duplicate date,the traditional Sorted Neighborhood Method(SNM)is optimized and the ICSNM method is proposed in this dissertation.Simulation experiments show that the ICSNM has improved efficiency and evaluation indicators over the original SNM method,making data cleaning more accurate and efficient.For the detection of duplicate XML data,a Bayesian network-based recognition method to construct a Bayesian network for XML data has been designed.When identifying whether two XML objects are duplicated,the method not only considers the repetition probability of the child nodes,but also considers the probability of repetition of all descendants.Experiments show that the detection method based on Bayesian network has higher detection accuracy than the original method,which can better detect similar duplicate data in the data set.Based on the previous research work,a tool method X-SNM for XML repeated data cleaning was designed and proposed.The comparison with the DogmatiX method demonstrated that X-SNM has obvious advantages over DogmatiX in terms of precision,recall and time efficiency.
Keywords/Search Tags:Data Cleaning, XML Data, Duplication Detection, SNM, Bayesian Networks
PDF Full Text Request
Related items