Discovering Frequently Changing Structures From Historical Versions Of XML Document

Posted on:2011-12-29

Degree:Master

Type:Thesis

Country:China

Candidate:J F Guo

Full Text:PDF

GTID:2178360305955055

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Data Mining (also known as Knowledge Discovery) is a non-trivial process, which extracts the implicit, unknown and potentially useful information and knowledge from the abundant, incomplete, noisy, and fuzzy dataset collected for some special applications. By analyzing and processing these data, we can extract the interesting knowledge, rules, or patterns we need, which can be applied to process control, high-level decision strategy, information inquiry processing, and eta.The Extensible Markup Language (XML) is developed based on Standard Generalized Markup Language. And it is a subset of SGML. XML, for its simplicity, standardization, self-describing, extensible and other features, has been identified as W3C standards (XML1.0) in February 1998. XML has been widely used as the de facto standard of electronic data representation and data exchange in the Internet and endorsed by IBM, micro, ORACLE and other well-known IT enterprises. With the increasing development of Internet technology, XML data are increasing explosively. So it is urgent to study a novel data mining technique which can discover valuable knowledge from mass XML data. However, XML data are typical of self-describing and semi-structured, which have no regular structures and often change. So the traditional data mining techniques for structure databases is not suitable for this kind of data. In order to address XML data mining problem, researchers presented a large number of algorithms to extract effective knowledge from abundant XML data.Frequent pattern mining technology is an important branch of data mining technology, covering the researches into sequences, trees and graphs, which is importantly suitable for medical information analysis, information retrieval, Web mining, and compound structure analysis. XML frequent pattern mining technology mainly contains frequent tag sequence mining, frequent sub tree mining and frequently changing structure mining and so on. The first two kinds of frequent pattern mining techniques are mainly developed for static snapshot XML data, which don't change after established. However, in practical applications, particularly with the rapidly growth of Web data in the Internet, Web data change as time goes on. On the basis of dynamicity, the researchers made exploratory researches on change detection algorithm for XML data, as well as frequent structural changes mining technology. They proposed some implicit and interesting objective knowledge, mainly including frequent changing structure, frozen structure, periodically changing structure, increasing or decreasing changing structure and so on, and put forward the basic framework for frequent structural changes mining. In this paper, we made some research on frequently changing structure mining.XML data change detection technology is an indispensable step in XML frequently changing structure mining. For a series of historical versions, we firstly need to obtain structural changing sequence between contiguous versions. Frequently changing structure mining is to discover the substructures changing frequently and significantly from a series of historical versions of the same XML document. The experiment shows that certain substructures'changing reaches the user's concern in the historically changing process, but it dose not truly reflect the structural changes.Based on the analysis of X-Diff, namely a change detection algorithm for contiguous orderly rooted XML document trees and traditional FCS mining algorithm, this work mainly proposed a meaningful knowledge discovery problem. Namely it is to discover the substructures in which changing degree of average depth and average width has reached our concern from a series of historical changes version. According to the tree structure's properties, we proposed a DOM data model with the depth and width parameters and a relative substructure mining algorithm, and conducted a large number of experiments.This first chapter gives an introduction of XML data mining research background and significance, briefly overviews current situation at home and abroad, as well as the three major frequent structure mining techniques, and presents this paper's research framework.The second chapter describes the basic knowledge of XML and frequently changing structure mining. The first part gives an introduction of XML standard's development and characteristic, XML document structure, and XML technology framework, and details two kinds of XML technical standards (DTD, DOM). The second part presents the fundamental concepts of frequent changing structure mining and FCS dynamic metrics. And finally, we put forward the problem to be studied.The third chapter, the first central part, details the DOM tree data model ADAW-DOM with the depth and width parameters. Firstly, we extendedly define the average depth and average width of tree structure, and propose three metrics measuring structure changes, including degree of spatial variation, degree of version changes, and degree of dynamic change. Both parameters are used to represent the tree's space structure. XML document is changing. The same XML document may contain multiple historical versions. If each version of an XML document is a DOM tree, there exist a large number of redundant data, and it will influence data storage efficiency and time efficiency of mining algorithm. ADAW-DOM data model is to compress a series of historical versions of the same XML document data into an XML document tree structure, in which each node is unique. In order to make substructure mining faster and more effectively, we store the version number and structure information of each node in different versions in the unique model. Combining the concepts and hierarchical structure, we can quickly locate and calculate the updated information after editing operations, such as insertion or deleting.The fourth chapter, the second central part, studies a changing structure discovery algorithm based on the average depth and average width. Firstly, it details ADAWFCS mining algorithm which contains two stages, namely creating ADAW-DOM structure and mining ADAWFCS substructure. Combining the hierarchical structure, the tree model is initialized using the tree post-order traversal strategy and updated; and then an ADAWFCS mining algorithm is given. Secondly, we theoretically analyze time complexity and space complexity of mining process in this paper. Experiments show that because the data model has high requirements of the memory, the algorithm is more suitable for medium and small-scale changing operation; In the process of changing document version of a randomly generated XML document, it can be effectively carried out to obtain required substructures. This experiment is run in Window XP environment, and debugged in the eclipse development tool using the JAVA programming language. And the test data are artificially generated.The final chapter summarizes the work done and the experimental results, and then proposes issues existing in the research and further researches.

Keywords/Search Tags:

Frequent Changing Structure Mining, Dynamic, Spatial Change, ADAW-DOM Model

PDF Full Text Request

Related items

1	Analysis Of Runoff Evolution Based On Data Mining Under Changing Environment
2	Research On Techniques Of Mining Frequent XML Patterns
3	Research Of Dynamic XML Document Mining Using Frequently Changing Structures
4	Research Of Data Mining Using XML Frequently Changing Sections
5	Study On Frequent Subtree Mining And Its Application In XML Mining
6	Study On Frequent Pattern Mining Algorithms And Pruning Strategies
7	Mining Temporal-Pattern In Spatial Multimedia Databases
8	The Techniques Research On Frequent Pattern Mining
9	Research On The Data Model And The Approaches To Data Mining In The Semi-structured Data
10	Mining Algorithm Based On Frequent Sub-graph Of The Multi-layer Index Structure