Interactive Data Integration Methods Based On Internet And Crowdsourcing

Posted on:2018-02-24

Degree:Master

Type:Thesis

Country:China

Candidate:B B Gu

Full Text:PDF

GTID:2348330542465256

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Schema Matching and Entity Matching are two key problems and necessary steps in integrating multiple data sources,where schema matching unifies the different schemas in heterogeneous data sources and entity matching detects records referring to the same realworld entity in heterogeneous data sources.Currently,the two processes have been thoroughly studied by researchers as two independent and important problems.However,there are intrinsic correlation between schema matching and entity matching.That is,the results of schema matching will affect the entity matching,and vice versa.Besides,previous work of schema matching and record generally rely on the data sources themselves which,however,can not reach satisfying matching results due to their lack of relevant domain knowledge.Based on above problem,in this work,we propose the interactive data integration methods based on Web Knowledge and Crowdsourcing? On one hand,we study the interaction between schema matching and entity matching by performing them alternately and in phase.So that schema matching and entity matching can benefit from each other which consequently lead to better data integration.On the other hand,we consider using external knowledge based on Internet and Crowdsourcing to assist us in better matching.To sum up,our main contents include:(1)This paper studies the interaction between schema matching and entity matching in integrating multiple data sources.To this end,we define novel matching rules for schema matching and entity matching respectively,that is,every schema matching decision is made based on intermediate entity matching results,and vice versa.(2)This paper proposes the interactive data integration method which has considered both the matching likelihood of attribute and record-pairs and semantic drift problem.We design a probabilistic model based on sigmoid function to estimate the likelihood of attribute and record-pairs.To solve the semantic drift problem,two effective methods are proposed to check the attribute and record-pairs for guaranteeing the matching quality.One checks the degree of deviation of every matching-entity-pair from the other matching-entity-pairs according to the unbiased variance,while the other employs cross-validation to use matching attribute-pairs to validate each other.(3)This paper proposes the interactive data integration method based on Internet and Crowdsoucing when there are plenty of missing values in data sources.We study how to acquire high quality data from Internet and when to use the Internet and Crowdsourcing.(4)To reduce the computational cost,we design an index structure based on q-grams,based on which,we reduce around 90% time cost of the interaction.

Keywords/Search Tags:

Data Integration, Schema Matching, Entity Matching

PDF Full Text Request

Related items

1	Research On Schema Matching Technology Supporting Massive Heterogeneous Data Integration
2	Research On Mutual Enhancement Of Entity Resolution And Schema Matching In Web Information Intergration
3	Research On Generating Matching Rules In Entity Matching
4	Research On Technology Of Schema Matching Between Global Schema And Local Schema
5	Domain-oriented Web Data Integration
6	A semantic analysis of XML schema matching for B2B systems integration
7	Research On Technology Of Deep Web Schema Matching
8	An Algorithm Of XML Schema Matching And Its Application In Heterogeneous Information Integration
9	Research And Application For The Semantic Matching Of XML Tags
10	Research On Ontology-based Schema Matching