Font Size: a A A

Interactive Data Integration Methods Based On Internet And Crowdsourcing

Posted on:2018-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:B B GuFull Text:PDF
GTID:2348330542465256Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Schema Matching and Entity Matching are two key problems and necessary steps in integrating multiple data sources,where schema matching unifies the different schemas in heterogeneous data sources and entity matching detects records referring to the same realworld entity in heterogeneous data sources.Currently,the two processes have been thoroughly studied by researchers as two independent and important problems.However,there are intrinsic correlation between schema matching and entity matching.That is,the results of schema matching will affect the entity matching,and vice versa.Besides,previous work of schema matching and record generally rely on the data sources themselves which,however,can not reach satisfying matching results due to their lack of relevant domain knowledge.Based on above problem,in this work,we propose the interactive data integration methods based on Web Knowledge and Crowdsourcing? On one hand,we study the interaction between schema matching and entity matching by performing them alternately and in phase.So that schema matching and entity matching can benefit from each other which consequently lead to better data integration.On the other hand,we consider using external knowledge based on Internet and Crowdsourcing to assist us in better matching.To sum up,our main contents include:(1)This paper studies the interaction between schema matching and entity matching in integrating multiple data sources.To this end,we define novel matching rules for schema matching and entity matching respectively,that is,every schema matching decision is made based on intermediate entity matching results,and vice versa.(2)This paper proposes the interactive data integration method which has considered both the matching likelihood of attribute and record-pairs and semantic drift problem.We design a probabilistic model based on sigmoid function to estimate the likelihood of attribute and record-pairs.To solve the semantic drift problem,two effective methods are proposed to check the attribute and record-pairs for guaranteeing the matching quality.One checks the degree of deviation of every matching-entity-pair from the other matching-entity-pairs according to the unbiased variance,while the other employs cross-validation to use matching attribute-pairs to validate each other.(3)This paper proposes the interactive data integration method based on Internet and Crowdsoucing when there are plenty of missing values in data sources.We study how to acquire high quality data from Internet and when to use the Internet and Crowdsourcing.(4)To reduce the computational cost,we design an index structure based on q-grams,based on which,we reduce around 90% time cost of the interaction.
Keywords/Search Tags:Data Integration, Schema Matching, Entity Matching
PDF Full Text Request
Related items