Font Size: a A A

Design And Implementation Of Duplicate Objects Detection In XML Document

Posted on:2012-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:W WangFull Text:PDF
GTID:2218330362456506Subject:Computer software design theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and information technology, the scope of application of XML documents as a data storage medium are more widely, great attentions have been paid to the problem of detecting duplicate XML elements. And the diversity of XML document's structure has caused great difficulties to the similar detection of the XML elements. To effectively remove duplicate elements in XML documents, recognition rules of duplicate elements had been studied, and a duplicate XML element detection system had been designed and implemented.The criteria of repetitive elements, identifying similar strings and similarity calculation of XML elements had been studied. And concluded that the key problem of detecting duplicate XML elements is how to effectively deal with diversity issues and how to find the complex dependencies between the parent and the sub-elements, and a duplicate XML element detection system had been designed and implemented. The detection system consists of document pre-processing module, the module of identifying similar strings and the module of the similarity calculation of XML elements.In the field of completing the detection system, a top-down, multi-detection filter detection methods had been studied. According to the analysis of XML data storage structure, the definition of repeating XML element objects had been studied; By preprocessing the document to some extent solved the problem of XML structural diversity; By designing a variety of filters, effectively reducing the comparison between strings and the similarity calculation of the XML elements; By the top-down traversing through the XML elements, solved the dependencies between the parent and sub-elements. To generating experimental data, the Dirty XML Generator (DXG) tool had been designed and implemented.To illustrating the testing and anglicizing the system accuracy and the effectiveness of filters, two different types of dirty data include structural error and string error by DXG had been introduced, for each filter a separate analysis had been carried out, and the test results are analyzed. Eventually note that the filters are effective and efficient, the result of the detection system are identical to the pre-test results.
Keywords/Search Tags:duplicate detection system, extensible markup language, similar string, multiple filters, top-down traversal
PDF Full Text Request
Related items