Font Size: a A A

Research On Key Technologies Of Record Match With Token

Posted on:2013-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:X BianFull Text:PDF
GTID:2268330392967970Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the application such as information integration, diferent records who refer to thesame entity are always generated. Those records’ diferences are caused by diferent de-scriptions in the same attribute. There are many reasons leading to diferent descriptions,e.g. transcription errors, lack of standard formats, incomplete information, diferent wayto describe the same concept and so on. Those reasons who lead to the diference in at-tribute also make the values in that attribute are similar, and often the similarity are veryhigh.To identify the diferent records referring same entity, a framework of record match-ing is proposed. Blocking, comparison and decision are the main parts of the frame-work. But the before research work on the record matching does not consider the in-fluence caused by information. In this paper, we discuss the influence caused by Tokeninformation in record matching, and present the concepts: Block-Attribute, Comparison-Attribute, Class-Attribute and so on.After the entropy analysis on the Block-Attribute, we present the approach to de-termine Block-Attribute using entropy and propose an algorithm to generate record pairsbased on Block-Attribute. When comparison, we analyse the information in the Token,then propose an algorithm to compute the similarity in one attribute, and come up withan algorithm to make up a vector for the record pair based similarity. We investigateClass-Attribute’s efection in the process of decision, and propose an efcient algorithmfor decision based on distance.Finally, we prove our algorithms are feasible and efcient on experiment.
Keywords/Search Tags:Record Matching, Record Linkage, Data Integration, Data cleaning
PDF Full Text Request
Related items