Font Size: a A A

Classification-based Data Integration Method

Posted on:2014-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2248330398957672Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology lead to explosive growth in the number of databases in recent years, these databases generally developed independently, each of them are autonomous, the description of the data model for data and data structure may not be the same, and only serving a limited application domain, without consider information sharing and collaboration, for example, an enterprise will exist a variety of heterogeneous databases because of acquisitions or the introduction of new database system. Integrating these relevant databases is of great significance, especially in the field of artificial intelligence and data sharing applications, integration of multiple heterogeneous data sources is a challenging task.The purpose of data integration is organize useful data in distribute data sources under a certain of strategy, concentrating data of different sources and format logically or physically, so that users can access data in a transparent manner, and provide users with comprehensive data sharing. There are two types of data integration solutions, virtual view and materialized. System sent the query to the data sources in virtual view, but directly applied to preprocessed data in materialized method. Based on research in data integration system and machine learning classification algorithms, we propose a data integration system based on classification, then applying on information of products structure-based WEB query-answering retrieval system, to provide users with products comparation, information consulting and auto query answering function according to hommization. Data sources of the system are from the network, and the domain is limited to mobile phone electronic products, building data sources by grabbing information on the network, preprocessing and preliminary structured processing of the basic information, comment of information, and query information of the related goods.We have detailed analysis and research on schema matching and data combine problems in data integration domain, first of all, the advantages and disadvantages of the existing integration system are being analyzed and compared. Secondly, researching on instance-based schema matching methods, designing classification-based schema matching methods, for the special research background of this paper, we proposed word similarity-based schema matching algorithm. LSH and MINHASH technology are introduced to reduce the time and space complexity, and improve the efficiency of schema matching. Finally, analyzing the problem of inconsistence, semantic conflict, data missing and truth finder of data sources, then strategies to solve these problems has been researched, and applying on information of products structure-based WEB query-answering retrieval system, the experimental results show that the method proposed in this paper can solve schema matching problem well and combining heterogeneous data sources.
Keywords/Search Tags:data integration, schema matching, data combine, classification, data conflict
PDF Full Text Request
Related items