| XML is a new standard of information issue and exchanging on the Internet, and how to store and query these data effectively in relational database has become an important problem in XML field. Today,nearly all commercial database products, such as SQL Server, Oracle and DB2, have been extended to provide support in storing and querying XML, and publishing data in XML format. Generally, there are two storage ways that RDBMS supports to XML data. One is to store the content of XML documents to RDBMS. The other is to store XML documents by whole files and store the indexes of these files in database. But with the rapid expansion of XML application and the huge emergence of XML data, these XML data may come from different data sources, so their document schemas (DTD or XML Schema) may be different and they are isomerous. So the current RDBMS products will be limit if we use them to store these XML data. That's, they can't integrate XML documents, which conform to different DTDs, into relational storage efficiently.This paper researches into the problem of integrating many XML documents, which conform to different DTD, into relational storage. The main research work is listed as follows.1. This paper presents an efficient method to mine embedded frequent trees in a forest of XML documents. The method is that we first preprocess XML documents to get SSTs (Simplest Structural Trees) and then mine frequent trees in SSTs. In this paper, we improve TreeMiner by breaking the bottle-neck of TreeMiner and considering the structure character of SST, and present an algorithm called SSTMiner. The experiments show that this method is efficient to mine frequent trees in XML documents.2. This paper researches XML document clustering using frequent structure, which includes frequent path and frequent tree. The paper firstly mines frequent structures in XML documents. Because SSTMiner is an efficient method to mine all embedded frequent trees in XML documents, it can be modified a little to generate FrePathMiner algorithm and FreTreeMiner algorithm, which can be respectively used to mine common frequent path and common frequent tree. Then by using common frequent path and common frequent tree to characterize the XML documents, an agglomerative hierarchical clustering algorithm called XMLCluster is propesed to cluster XML documents. Finally, the experiment is conducted, in which three methods, including FrePathMiner, FreTreeMiner and traditional ASPMiner(Adaped Sequential Pattern Miner), are used in the XML document clustering. And the clustering results show that both FrePathMiner and FreTreeMiner can get higher clustering precision than ASPMiner.3. This paper presents a solution to integrate XML documents based on clustering, which integrates XML documents into relational storage. The process of integral storage can be divided into two steps. The first step is called schema mapping, in which the integral relational schema is generated. The second step is called XML storage, in which all XML information is extracted into relational database, and XML documents are eventually integrated into relational database efficiently.Finally, this paper presents the complete frame of the system of integrating XML document using clustering and shows the method to be efficient by some detail examples. |