Font Size: a A A

Research Of Dynamic XML Document Mining Using Frequently Changing Structures

Posted on:2011-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LuoFull Text:PDF
GTID:2178360305455078Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, people can receive and send information from anywhere in the world via the use of Internet. However, heterogeneity of the data formats greatly hindered the effective use of information. The emergence of Extensible Markup Language which can describe both the data content and structural characteristics give the solution. Today, a large number of the semi-structured data are represented by the form of XML, which has become the facto standard for information representation and exchange on the World Wide Web. In the face of rapid growth of XML data, it's in urgent need of effective data mining method to discover useful information.Data mining is the non-trivial process of extracting effective, implicit, novel, unknown and potentially valuable information from abundant, incomplete, noisy, vague and random data sets and ultimately generating understandable patterns. Traditional data mining primarily aim at relational data, object-oriented data, transaction data and non-structural data. XML is a self-described, semi-structured markup language which can describe a unique and novel complex data structures. Its characteristics of implicit information patterns, irregular structures and the lack of strict type constraint make the traditional data mining can not be competent. Lots of researches of XML data mining have been carried out at home and broad. The tasks of XML data mining contain frequent pattern mining, association rule mining, classification, clustering and so on. However, most studies are based on static XML documents. In practice, XML documents are in a frequently changing process with the changing of historical version. A series version of the XML documents imply changing information, static mining techniques can no longer deal with the dynamic changing information, there is an urgent need to work out the relevant dynamic XML data mining techniques. This paper focuses on the research of dynamic XML data mining techniques. The relevant research work is supported by scientific and technological development program of Jilin province (20090704)"research on the key techniques for semi-structured database".Dynamic XML data mining tasks include discovering interesting structures, association rule mining, classification, clustering and so on. Frequently changing structure is a kind of interesting structure. XML version changing is relevant with the property of dynamic and temporal. Traditional static XML document can not record such a changing process, so people had to retain the large number of XML document historical versions and then do analysis to the documents, which brought great difficulty to the dynamic XML data mining. From the XML document historical versions, we observed that many nodes did not change. If save lots of XML document versions, it will make many nodes duplicated and lead to waste space, the management of these XML documents has also become a problem. For the reason we only need to consider how we can record the dynamic changing messages of the XML.As time goes, XML documents will change frequently. Therefore, analyzing the shortcomings of the existed algorithm for mining frequently changing structure, this paper adopts the method of adding time information to construct the XML document and presents temporal XML model T-DOM which is suitable for FCS mining. Then using three parameters for measuring frequently changing structure, we present two data mining algorithms for discovering FCS in dynamic XML documents based on temporal model: Breadth-first traversal mining algorithm (FCSBF) and depth-first recursive traversal mining algorithm (FCSDF). Compared with existed FCS mining, temporal XML documents mining can save lots of time. Non-temporal XML temporal fail to reflect the dynamic nature of the XML data, so it need structural change detecting tool to compare two adjacent historical XML documents, and then map the generated structural delta to the H-DOM tree, at last it need to scan the H-DOM to get the FCS. This process is quite time-consuming, whereas the temporal model has put version structural delta into the document in the method of timestamp and then only need to scan T-DOM tree to generate FCS. In the storage space, non-temporal XML documents mining needs to store a large number of historical XML documents, in which many of the nodes do not change, so there are many redundant nodes. For the temporal XML document, all nodes are unique, so the lower the rate of node change, the more storage space is saved. FCSBF algorithm uses breadth-first strategy, the calculation for each node must traverse all of its descendant nodes, so some of the node will be scanned more than one time. But the breadth-first traversal can prune the T-DOM tree based on version dynamic threshold, the lower rate of change the more pruning, so it can save time for the traversal. FCSDF algorithm uses the depth-first strategy and only needs to recursively traverse the document once, while the non-leaf nodes need to be pushed into the stack and pop out of the stack frequently to receive time point table of the child nodes, the combining of time point table also need to waste some time. Experimental results show that the algorithms are efficient.A novel clustering algorithm DClustering is presented based on FCS mining. The algorithm uses FCS structures as feature vector of dynamic XML document and adopts the bottom-up agglomerative hierarchical clustering method for clustering dynamic XML documents which are presented by temporal model. We use artificially generated dynamic XML documents and vary the number of the documents to test the clustering algorithm. Experimental results show that the clustering method is effective and feasible. The main structure of this paper is as follows:The first chapter is the Introduction, which briefly describes the research background and significance of this paper, gives the present research status at home and abroad, and gives the main research content and the thesis structure.The second chapter illustrates the XML technology and XML data mining-related theoretical knowledge. First, the XML document structure, characteristics of semi-structured data and XML data and main XML technical specifications are described. Then we introduce the process of XML data mining, XML data model, tree-based XML mining framework and correlative dynamic XML data mining algorithm.The third chapter gives the temporal XML model T-DOM of dynamic XML data. We present two mining frequently changing structure algorithms: breadth-first traversal mining algorithm (FCSBF) and depth-first recursive traversal mining algorithm (FCSDF), give the efficiency analysis of the algorithm and carry out the experiments and performance evaluation.The fourth chapter presents an algorithm of DClustering which clusters dynamic XML documents presented by temporal model based on chapter III, using frequently changing structures as feature vector of document, meanwhile, experiments are performed and the results analysis are shown up.Chapter five is the conclusion and prospect, which sums up the paper work, puts forward the problems in the research process and future research work. Dynamic XML documents mining algorithms presented in the paper have broad application prospects, which can be used for large-scale XML documents change detection, improving the information retrieval efficiency, improving search engine results, dynamic-conscious XML caching and so on. Due to the limits of time and capacity, this paper only realizes the data mining algorithms of aiming at dynamic XML document be represented by temporal XML documents, however, such documents are not existence in practice. So some further research must be done, including its storage, retrieval, updating and other related data management methods. We will use real data rather than the artificial data to performance the experiments. Furthermore, we will do research on other related dynamic XML data mining algorithms.
Keywords/Search Tags:Dynamic XML Document, Data Mining, Frequently Changing Structure, Documents Clustering
PDF Full Text Request
Related items