Research Of XML Semantic Clustering Based On Weighted Edge Set Comparison Algorithm

Posted on:2011-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2178360305450706

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

XML (eXtensible Markup Language) with the simple, scalable, strong inter-operable and open features is becoming a kind of standards and transmission format for data exchange, which is unrelated to the technology. Compared with HTML, XML has greater flexibility. It not only can be used to tag the text of unstructured information but also can be used to mark highly structured data (e.g. data in the database) With the rapid growth of XML data on the Web, how to help users quickly and efficiently retrieve a large number of XML data and get the useful information will become an urgent issue to resolve.Document clustering is an effective means to help people retrieve information. In order to effectively analyze the information in the XML document, so the research of XML document clustering has become a hotspot in current research. The key point of XML Document Clustering is measure of the document similarity. As XML documents is Half-Structure text, and its information Can be described via documents structure. Thus, not all the text similarity algorithm is available for XML documents clustering.The current calculation methods of XML document similarity are:the method of elements comparison, edge set comparison algorithm and tree edit distance method. The elements comparison method is simple and fast, but it only considers the number of nodes, it does not take into account the structural complexity of XML document tree, so the clustering results are not very satisfactory. The tree edit distance method takes into account the complex structure of XML document tree and nodes similarity, and it can get a good clustering result, but it has a higher time complexity. The performance of edge set comparison method is between elements comparison method and edit distance method. This paper just extends edge set comparison method, and proposes the weighted edge set comparison algorithm, which eliminates the nested and repeated nodes of the XML document tree, and gets the effective simplified the XML labeled tree. It combines semantic information to measure the similarity between XML documents. After getting the similarity among the XML trees, it uses classified clustering method to cluster XML documents.Based on the classic edge set comparison algorithm, this paper makes the innovation as following:1. The idea of edge set comparison algorithm with weight is proposed. It gives some weight for each side of the XML summary tree according to the structure complexity and the level, so it strengthens importance of the structure and levels of the XML tree.2. The new algorithm calculates the edges similarities of XML labeled tree combined with semantic information, then gets the set of semantically equivalent edges so as to determine similarity between the two XML labeled trees.The experiments show that the semantic-based weighted edge set comparison algorithm has better clustering results.

Keywords/Search Tags:

Data Mining, XML Cluster ing, Edge Set Comparison Algorithm, Semantic Similarity

PDF Full Text Request

Related items

1	Semantic Framework Filling Research Based On Information Extraction
2	The Research On Semantic-driven Image Mining Using Statistical Learning
3	Research On The Key Technology Of The Price Comparison System Based On Semantic Similarity
4	The Study Of Clustering Algorithm In Case Data Mining
5	Research And System Development Of Content Duplicate Chechking In E-business Website Based On Semantics
6	Research On Concept Semantic Similarity Comparison Method In OWL Ontologies
7	Study On Similarity-based Text Clustering Algorithm And Its Application
8	Methods And Applications Study Of Cluster-based Spatial Data Mining
9	Design And Implementation Of Coupon Price Comparison System Based On Android Mobile
10	Die Body Similarity Comparison Algorithm Research