Font Size: a A A

Entity Resolution For Web Data Integration

Posted on:2011-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q KongFull Text:PDF
GTID:2178360305451073Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays much work needs to integrate information from heterogeneous data sources into single database. Data integration involves combining data residing in different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations both commercial and scientific. Data integration appears with increasing frequency as the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.As the Internet develops rapidly nowadays, the amount of information on the web increases every minute. If we acquire, organize and store useful information from the web into data warehouse for future use like data mining, decision making and so on, it will be highly meaningful for people's lives. Our research on Web Data integration can be divided into several different but closely related components: Domain Model Construction, Data Sources Process, Data Extraction, Schema Matching, Entity Resolution and User Application Interface construction.In this overall framework Entity Resolution plays an important role. Entity resolution is an intelligent process, by which organizations can connect disparate data sources with a view to understand possible identity matches and non-obvious relationships across multiple data sources. Generally, Entity Resolution deals with the problem that real world entities are referred to by descriptions (references) which are not always unique identifiers of those entities. Similar references which potentially refer to one real entity must be reconciled before this database can be efficiently used for further process. For example, the data mining techniques can be used to analyze data for decision making only after the data is well prepared. In data integration, data from different data sources must be reconciled before they are integrated into data warehouse for further use such as OLAP or other data services.In Web Data Integration there are new challenges for Entity Resolution and we proposed new solutions for them: 1. Entities have context attributes. Pair-wise methods are utilized when two entities share some common attributes. These methods compare the values of the common attributes and combine the results of different similarity values of common attributes to compute the overall similarity. When the common attribute values of two entities are rich, the pair-wise methods are efficient enough in Entity Resolution. In this part we proposed methods based on Decision Tree to study the hierarchy of common attributes based on the different importance and weight of these attributes.2. In other situations, because of the uncertainty of Data Extraction and Schema Matching before Entity Resolution, there are many attribute values missing in the data warehouse, thus pair-wise methods which compare the common attribute values to compute the overall similarity value have limitations. As different entities always have relationships between them, we build similarity association graphs based on record-level relationships to compute the similarity values between references exploiting some Graph theory algorithms.3. In many applications, multi-type entities exist in the data warehouse and many of them need reconciliation. If we reconcile these multi-type entities separately, previous results could not be used to improve the results of following Entity Resolution. We proposed a method to reconcile multi-type entities collectively to acquire better results than separate Entity Resolution.
Keywords/Search Tags:web data integration, data warehouse, entity resolution, references, pair-wise, graph, collective
PDF Full Text Request
Related items