Font Size: a A A

Evaluating Join Queries With Real-time Entity Resolution

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChengFull Text:PDF
GTID:2428330647455115Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The join query techniques often require a high time cost.As the amount of data increase,the join query techniques lose their usefulness because of the time cost overhead.In reality,for dirty datasets with a large number of duplicate tuples,these techniques will yield tuples duplicates for a query,which leads to poor effectiveness.In this paper,the scanning and resolving of the whole dataset is transformed into the operation of partial regional blocks by a partitioning method,and then the methods of evaluating join query with real-time entity resolution?ER?are utilized to solve the problem of low efficiency and effectiveness caused by traditional join query.This paper focuses on evaluating Top-N join query with real-time ER and defines the Top-N join query model using equivalence relation and equivalence class.For a dataset R=R1?R2?????Rs,the query point Q=?Q1,Q2,???,Qs?,each Ri?i=1,2,???,s?is divided into several disjoint blocks by using an index based on a divide-and-conquer mechanism.The nearest block and the sorted list to Qi are probed by an algorithm.The nearest tuple of Qi is found by using binary search,the distance between the value of sorting attribute and the component of Qi is calculate,the query range is determined,and then the Top-Ki tuple of Qi is searched one by one in the block within the query range.When a Top-Ki tuple is pushed from Ri,it is joined with the corresponding tuple in a buffer,and the Top-N query result of Q on R is determined by double threshold methods.In this way,the search space and the time cost of a join query can be reduced,and then the performance of evaluating a query is improved effectively efficiently.To remove duplicate tuples in the results of a query,two methods for processing Top-N join queries with real-time ER are proposed.One method integrates ER with the processing of a Top-N join query over dirty datasets on the fly,which is removes duplicate tuples simultaneously in query results.The other first performs real-time ER with the index for dirty datasets to obtain the corresponding clean datasets containing clusters of duplicate tuples,and then it evaluates a Top-N join query over the clean datasets by employing the outer join operations of the clusters to obtain the joined tuples and Top-N joined results.In addition,for the dirty datasets with duplicate tuples,the models and methods of the point join query and the range join query with real-time ER are given by modifying and optimized that of Top-N join query.By the index,the processing methods locate Qi of a point query Q=?Q1,Q2,???,Qs?is a nearest block and find the tuple which equal to Qi in the nearest block.On the other hand,the methods for evaluating a range query find the blocks that intersect or contain the query range,and find the tuples in the query range through binary search in the blocks.Real-time ER is combined with the processing of a point join query or a range join query to resolve duplicate tuples in the results of a query.For processing methods of the above three types of join query with ER,using 2,3,5 and10 dimensions dirty datasets,three different distance functions?Manhattan distance,Euclidean distance,Maximum distance?,extensive experiments are conducted to test the performances of join queries over two,three and four join datasets.The experimental results show that the performances of the methods are efficiency and effectiveness.
Keywords/Search Tags:Join query processing, Entity resolution, d-dimensional data space, Algorithm
PDF Full Text Request
Related items