Font Size: a A A

The Research And Implementation Of An XML Document Structural Clustering Algorithm Using Frequent Path Pattern

Posted on:2011-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2178360305955345Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text analysis is a key data mining application, and text analysis can be applied to web data mining, XML data mining research. Through text analysis, can be achieved for the web page, XML document classification or clustering, but the text is the text similarity classification or clustering is an important standard.Data mining, also known as knowledge discovery (Knowledge Discovery in Data), is the use of large amounts of data from the classification, association rule mining or clustering tools such as automatic extraction of patterns representative of the knowledge process. Data mining is the tool to analyze large amounts of data, derive a value of patterns and trends to help a decision-making process.With extensive use of XML technology and the increasing number of XML data, XML has become the standard exchange language of Internet, data representation. How to effectively process XML data in management and analysis, becomes a hot research concern. As a semi-structured data, XML documents are used to the representation and storage for XML data. XML document structure information and content information to describe the XML data has an important role. With data mining technology used in XML documents, XML documents clustering of the structure of information retrieval and data processing in areas such as playing an increasingly important role.The current XML data mining, including frequent substructure mining, classification and association rule mining. In these directions, the frequent substructure mining is the subject of more research. Between XML documents extracted in a number of more frequent occurrences of sub-structure, these frequent substructures in the efficient retrieval and XML integration applications have practical value. Currently, technology for XML document clustering methods have many, but for the XML schema language of the clustering method for less. Suppose a situation in which a large number of XML documents comply with an XML model (model may be DTD, XML Schema, etc.), then such a document is usually the same type of document, or describe the same type of goods, or that the same type of data, in this case, the clustering of the XML model can meet the same model instead of XML document clustering. Therefore, the study of clustering techniques for the XML model is very significant, our study selected XML Schema as the object of XML Schema's clustering technology.Common Knowledge Discovery include:association rules, classification, clustering and so on.View of the XML document in web technology, database technology and other fields and applications abound. Currently, XML document clustering technology has two main directions, one for that DOM tree structure of XML, clustering, and second, the path for the XML document node cluster. Focus is the difference between the two, the former mainly for clustering document itself, which the path through the analysis of relevant nodes can not only determine whether the structure of a document similar to the similar nodes can be analyzed.On the other hand, XML documents clustering based on documents not only to the structure, but also according to the meaning of the name of the node element. Therefore, the complete XML document clustering method should have two parts, structural elements of clustering and clustering, the perfect cluster structure should be based on different practical needs, elements of structural similarity and semantic similarity of the demand is different, the balance between the two is a problem worthy of study.This article describes the XML data mining with traditional structured data, text data mining of differences, discussed several sequential pattern mining and clustering algorithm, and clustering the similarity calculation processing, and on this basis to achieve a Frequent path characteristics based on XML document structure clustering algorithm-PBClustering algorithm. The main work of this algorithm for PBClustering in high noise areas and large amount of data inadequacies, by adjusting the calculation of similarity, feature weighting and enhance scalability was improved by means of its optimization.XML document clustering in the implementation process, the understanding of the basic ideas and PBClustering algorithm implementation steps, and against the method in dealing with high noise and large data Liang, etc. shortcomings Tongguotiaozheng method of similarity, Tezhengjiaquan and enhance the scalability of their approach was improved optimization.Experimental results show that the improved algorithm is better than the original algorithm in clustering accuracy on 10%-20% increase, and time efficiency are dramatically increased. This XML document clustering technology for the research for further study of data mining analysis of XML documents, and XML information retrieval techniques to lay a solid foundation.The paper uses PBClustering algorithm INEX XML Mining Competition [10] in the nearly 5,000 cluster document processing, and the result is not satisfactory. This major work is PBClustering algorithm is improved in order to obtain the ideal result of cluster analysis.In future research work, on the one hand, needs to integrate XML content of the semantic information, improve clustering of XML documents; the other hand, on the basis of XML clustering technology, combined with the characteristics of XML itself, of XML queries, information retrieval and a more extensive application prospect.
Keywords/Search Tags:XML document clustering, frequent itemset mining, sequential pattern mining
PDF Full Text Request
Related items