Font Size: a A A

Research On Similarity Join Processing Based On Entity

Posted on:2013-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiuFull Text:PDF
GTID:2268330392467996Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With advances in technology of information production capabilities andinformation collection, data inconsistent, incomplete, outdated, and imprecision areprevalent. Traditional relational database can’t deal with poor-quality datamanagement and query. Therefore, a new data model and operations is in demand.In traditional relation database, multi-tuples representing the same entity is themost common type of poor-quality data. Organizing the multi-tuples whichrepresent the same entity is an effective method of management of poor-quality data.This paper gives formal definitions of the entity model and entity-relationshipdatabase, as well as the similarity join operations in the entity database.Similarity join operations have a wide range of applications in the datacleaning, information integration, fuzzy keyword search, fraud detection and manyother fields. Multi-attribute value of the entity characteristics the connotation ofsemantic information in the similarity join of the entities, which extends the stringsyntax similarity join, results query processing more accurate and complete. Thispaper studies the similarity join in the entity database, using the existing"filter-and-verify" framework, and proposes an entity similarity join algorithmwhich named ES-JOIN algorithm. ES-JOIN algorithm adopts double index,probability-based filtering measures efficiently solute the entity similarity joinproblem.Also, because of the order of multi-table joins have an important impact to thejoin efficiency, this paper further studies the multi-table join order selection methodbased on the entity database: entity-based Markov chain Monte Carlo methods(MCMC) to estimate the entity similarity join size, and raises a cost model tooptimize the order of multi-relations of entity on join problem. Our work solutespoor-quality data management and query problems efficiently and have importantpractical meanings.
Keywords/Search Tags:data quality, Monte Carol, entity, similarity join, joins size, entitydatabase
PDF Full Text Request
Related items