Font Size: a A A

Research On Data Fusion For Deep Web Data Ntegration

Posted on:2013-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:H T DouFull Text:PDF
GTID:2248330374981396Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, the Web contains more and more information and has became a huge, widespread and global online information source. Especially in recent years, all kinds of large databases were established gradually in order to satisfy the personal or commercial demand, which make the Web become an indispensable part of people’s life. The informations on the Web are disorderly and the information types are complicated, so we can divide the whole Web into Surface Web and Deep Web by the way of the data to be accessed. The Surface Web refers to the set of static pages that can be indexed through hyperlinks by traditional search engines; the Deep Web refers to the online Web database, the contents of which are hidden behind the query interfaces and can not be indexed by traditional search engines. The studies indicate that the Deep Web has a great deal of strongly thematic and highly structured informations, and covers a wide range of domains. In order to make full use of these valuable resource (used for further analysis and mining), we urgently need to integrate the Deep Web data.In a variety of domains, the amount of Web information grows rapidly, and the types of data sources are proliferating. However, these informations are not always credible, and different data sources often provide heterogeneous or conflicting data. So the information integration faces a huge challenge:How to get the information which is really useful to the user from these vast amounts of data. Therefore, we need to distinguish between what is true and what is wrong by data fusion, and get high-quality data for analysis and decision.In recent years, data fusion technology has won more and more attention, and many researchers have achieved a lot of results in this domain. At present, data fusion still has many problems to be solved:(1) The qualities of Deep Web data source are variable, and the qualities of the data provided by different data sources are quite different. The values provided by high-quality data sources often have higher confidence. So we need to estimate the quality of Deep Web data sources before data fusion, and apply the estimation results to the process of truth discovery.(2) There is not a standard and perfect method for data fusion, so we need to solve the data conflicts and find true values considering the accuracy of data sources, the dependence between data sources, the implication between values and other factors.This paper aims at the data integration for Deep Web, and we have done a lot of research and exploration in quality estimation of Deep Web data sources and truth discovery method. The main research works and contributions are as follows:1.This paper proposes a quality estimation model of Deep Web data sources (DSQ). Deep Web data sources have great difference, and data sources with different qualities often provide data with different qualities. However, most of the data fusion researches at present don’t estimate the quality of Deep Web data sources specifically, but assign the same quality to each data source at the beginning of calculation and keep improving and perfecting the qualities of data sources through the iterative algorithm. In order to improve the data fusion, we propose a quality estimation method of Deep Web data sources. According to the characteristics of data fusion, our estimation model selects three dimensions of factors-data quality, interface quality and service quality-as estimation criteria, quantify each quality estimation factor, finally score for the quality of each data source and get estimation results. Then, we apply the estimation results to the process of data fusion. The experiment shows that our model can accurately estimate the quality of Deep Web data sources, and significantly improve the data fusion.2.This paper proposes a truth discovery method aiming at the Deep Web data integration. In a variety of domains, the amount of Deep Web information grows rapidly, and these data sources provide a lot of conflicting data. So it’s very important to find the correct informations that are really useful to the user from these conflicting data. Combining with our research background(data integration for market intelligence), We propose a data fusion calculation model aiming at the Deep Web data integration. The model finds true values from conflicting data, which considers there factors:the accuracy of data sources, the dependence between data sources and the implication between values. Because the factors are interactional, we calculate these factors iteratively and keep improving the values of these factors until the results converge. Meanwhile, we also apply the estimation results of data sources to our model. The experiment shows that our truth discovery method has high efficiency.
Keywords/Search Tags:Deep Web Data Integration, Quality Estimation of Deep WebData Sources, Data Fusion
PDF Full Text Request
Related items