Font Size: a A A

Research Of Structural Similarity In XML Documents And Its Application In Text Clustering

Posted on:2008-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2178360212494646Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, along with the achivement of Informationize with society develops in depth , the demands and the degree of reliance of human being to information are more and more high , how to gain useful information rapidly and effectively from liquor information resources with the great capacity, already becomes the focal point that people studies and also is a great challenge to the information retrieval.Because the similarity among documents is the base of information retrieval,data mining and Deep-seated intelligentize handling,it is very important to do the research on the document similarity. The document similarity determines the accuracy of the results in information retrieal.XML language has accepted more and more application ,because it has "self-describing" , "shape structure "and "structure nested" etc. It has emerged as a standard for representing and exchanging data on the World Wide Web. XML is a typical half-structural data that be able to express data such as with relations and structure, being applied in large amount in the data exchange and integration. Similarity degree therefore how to calculate the XML documents especially its structure similarity degree is the major task that we study at present.Along with the continuous deepening of the XML document , Because the use of "structure nested" to express the semantic information of elements in XML documents,so the tradition similarity methods don't meet the needs.Traditionally,tree model(document object trees) is used to represent XML documents.and calculates the document similarity using the tree edit-distance.Howerve, computation of structural similarity between document based on the tree model is very expensive,especially when the size of document is large and complex,and it doesn't deel with the repeated elements effectively in documents. On the other hand, the information is hugeous when retrievaling information througn search engine in Internet,but people usually only care about th first 20,how to increase the number of the first 20( or first n items) connect with the users,namely the enhance retrieval accuracy is another difficult problem in research.In this thesis, to solve these questions,we propose an alternative model to present the structural information of document based on the tree model,called the paths of tree model,and define the corresponding similarity measure,the model predigests the question and reduces the complexity.The model is powerful enough to distinguish the similar structural documents.At first,we propose a similarity measure based on the tree path to calculate the similarity between XML documents.This measure simplifies the description of XML document, and accordingly reduces the complexity in computation. It is powerful enough to distinguish the similar structual documents when the different clustering has very unlike structure,and also performs well on document clustering.Secondly,the tree model retains information on all parent-child relationships,but ignores sibling relationships;ignores the weight of tree path;only uses the complete matching when compute the similarity between tree paths and etc,we also propose a stronger model to reslove the questions.The inproved measure deals with the repeated element (in this thesis are repeated tree paths),and leading to similarity scores that are more intuitive than the ones generated by traditional similarity measures.At last, the thesis conducted various experiments towards the approach and the experimental results are more able to identify the XML documents that have the same strucutre and put them into the same clustering in text clustering,compared to the traditional method, the text clustering based on the tree path model woks fairly well and improves the precision of information retrieval.
Keywords/Search Tags:XML, The Tree Path, Strucral Similarity, Text Clustering
PDF Full Text Request
Related items