Structure and content semantic similarity detection of extensible markup language documents using keys

Posted on:2011-01-07

Degree:Ph.D

Type:Dissertation

University:Missouri University of Science and Technology

Candidate:Viyanon, Waraporn

Full Text:PDF

GTID:1448390002970170

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures.;This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates.;Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Design And Implementation Of Electronic Document Sensitive Information Detection System Based On Content Similarity
2	Design And Implement Of Dulplicate Document Detection Based On Similarity Estimation
3	Research Of Copy Detection Of Chinese Scientific Papers Base On Text Structure And Content
4	Application Of Document Similarity Detection In Enterprise Document Leakage Prevention
5	Chinese Document Content Similarity Detection Methods Research
6	Reserch And Application On Document Similarity Detection Based On Minwise Hashing
7	Research On Semantic Similarity Computation And Applications
8	Web Page Structure Similarity Algorithms And Applications,
9	Document analysis: Table structure understanding and zone content classification
10	Automatic Update Of Ontology Concept Hierarchy With Structure-Content Similarity Measurement