Font Size: a A A

Study On ETL Technology Based On XML Data Resouces

Posted on:2010-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:N SunFull Text:PDF
GTID:2178360272985313Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With long-term information construction, the enterprise and internal organization have produced a large number of legacy systems. These systems lack systematic and consistent considerations in them operations, so they produced large numbers of heterogeneous information. The information can't share and exchange effectively, it becomes inevitable to develop the information technology based on the XML language to realize information integrated platform. In the process of information integration, it is difficult to ensure the data quality which lead to reduce reliability of decision analysis. ETL(Extract, Transform, Cleaning, Load)becomes a very important part, it will transform data of source systems to useful information using for decision support. So the study of ETL technology based on the XML language is particularly important. This paper can be summarized as the following few contents:Firstly, it proposes an extraction scheme in information integration by Common Warehouse Metamodel(CWM) based on analyzing the advantages of the combination of XML, CWM and information integration; structured information integration frame- work based on Common Warehouse Metamodel(CWM) is constructed; Wrapper based on metamodel which is a model with common and independent from any particular implementation is designed; the problem of Wrapper maintenance caused by data sources'change in structured information integration is resolved.Secondly, it proposes the detection method of the combination of node weighted and tree edit distance based on analyzing the existing XML data similarity detection technology, the method matches roughly and assembles simply data by calculating similarity of XML weighed tree and then detect similarity by tree edit distance method in each set. Pre-treatment of XML data reduces unnecessary tree edit operation, so time complexity is reduced significantly.Thirdly, in order to make the theory used to practice, an experiment simulation has been implemented in an application instance. The architecture of the special equipments integrated system is constructed, metadata uniform format of data extraction in the process of information integration is designed and extract XML data needed. In addition, in order to verify the approximately duplicated records detection method in this paper, we extract XML data from bottom databases'tables for different DTD and make similar repeated detection for it, which verify the related technologies proposed by this paper.
Keywords/Search Tags:Information integration, eXtensible Markup Language data sources, Common Warehouse Metamodel, Extract-Transform-Load, Tree edit distance, Approximately dupli-cated records
PDF Full Text Request
Related items