Font Size: a A A

Research Of Entity Rsolution On Uncertain And Temporal Data

Posted on:2015-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:H B FengFull Text:PDF
GTID:2298330422990879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In data management, data quality is one of the most issues. Tradditional DBMStypically focuses on quantity of data, i.e., aimming at the creation, maintance andretrieval of large data. However, real-life data is often dirty: inconsistent, duplicated,inaccurate, incomplete or stale. Dirty data in a database routinely generate seriousconsequences in many aspects of society life. Thus, there are lots of needs of dataquality management and it has a bright futurein now information society. Thetechnology of Entity Resolution aims at detecting different representations of thesame entity from different data sources. Entity resolution helps to solve the abovedata quality problems and it plays a foundamental role in data qualitymanagement.This thesis discusses problem of entity resolution on dirty data andproposes a series of algorithms to resolve entities on datasets that are inaccurate,stale or incomplete.As we know, this is the frist work to propose the problem of entity resolutionon uncertain data. In this work, we give a probability based similarity metric andseveral corresponding similarity join algorihtms and clustering algorithms. In thesimilarity join algorithms, we propose to integrate the prefix filtering principles tocompress the computational space significantly. Experiments shows that ourproposed techniques performs well in both of efficiency and scalability.As for the problem of entity resolution on temporal data without availabletimestamps, we propose a set of rule-based algorithms and this is the first work toaddress this problem. In this thesis, we first integrate the data currency rules todetermine the relative currency oder of temporal data. Then, based on thedetermined currency order, we define the unstableness property of attributes oftemporal tuples to model the evolving trend of temporal data. The integration ofunstableness property can improve the accuracy of pairwise similarity join. Inaddition, we also propose a temporal clustering algorithm and the correspondingoptimizing algorithm. Experiments show that our proposed algorithms can resolvethe problem of entity resolution on temporal data withou available timestampssuccessfully.
Keywords/Search Tags:data quality, data management, entity resolution, dirty data
PDF Full Text Request
Related items