Evaluating Join Queries With Real-time Entity Resolution

Posted on:2021-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y Cheng

Full Text:PDF

GTID:2428330647455115

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The join query techniques often require a high time cost.As the amount of data increase,the join query techniques lose their usefulness because of the time cost overhead.In reality,for dirty datasets with a large number of duplicate tuples,these techniques will yield tuples duplicates for a query,which leads to poor effectiveness.In this paper,the scanning and resolving of the whole dataset is transformed into the operation of partial regional blocks by a partitioning method,and then the methods of evaluating join query with real-time entity resolution?ER?are utilized to solve the problem of low efficiency and effectiveness caused by traditional join query.This paper focuses on evaluating Top-N join query with real-time ER and defines the Top-N join query model using equivalence relation and equivalence class.For a dataset R=R₁?R₂?????R_s,the query point Q=?Q₁,Q₂,???,Q_s?,each R_i?i=1,2,???,s?is divided into several disjoint blocks by using an index based on a divide-and-conquer mechanism.The nearest block and the sorted list to Q_i are probed by an algorithm.The nearest tuple of Q_i is found by using binary search,the distance between the value of sorting attribute and the component of Q_i is calculate,the query range is determined,and then the Top-K_i tuple of Q_i is searched one by one in the block within the query range.When a Top-K_i tuple is pushed from R_i,it is joined with the corresponding tuple in a buffer,and the Top-N query result of Q on R is determined by double threshold methods.In this way,the search space and the time cost of a join query can be reduced,and then the performance of evaluating a query is improved effectively efficiently.To remove duplicate tuples in the results of a query,two methods for processing Top-N join queries with real-time ER are proposed.One method integrates ER with the processing of a Top-N join query over dirty datasets on the fly,which is removes duplicate tuples simultaneously in query results.The other first performs real-time ER with the index for dirty datasets to obtain the corresponding clean datasets containing clusters of duplicate tuples,and then it evaluates a Top-N join query over the clean datasets by employing the outer join operations of the clusters to obtain the joined tuples and Top-N joined results.In addition,for the dirty datasets with duplicate tuples,the models and methods of the point join query and the range join query with real-time ER are given by modifying and optimized that of Top-N join query.By the index,the processing methods locate Q_i of a point query Q=?Q₁,Q₂,???,Q_s?is a nearest block and find the tuple which equal to Q_i in the nearest block.On the other hand,the methods for evaluating a range query find the blocks that intersect or contain the query range,and find the tuples in the query range through binary search in the blocks.Real-time ER is combined with the processing of a point join query or a range join query to resolve duplicate tuples in the results of a query.For processing methods of the above three types of join query with ER,using 2,3,5 and10 dimensions dirty datasets,three different distance functions?Manhattan distance,Euclidean distance,Maximum distance?,extensive experiments are conducted to test the performances of join queries over two,three and four join datasets.The experimental results show that the performances of the methods are efficiency and effectiveness.

Keywords/Search Tags:

Join query processing, Entity resolution, d-dimensional data space, Algorithm

PDF Full Text Request

Related items

1	Real-time Entity Resolution And Query Processing Based On Region-tree Indexing
2	Lav In Data Integration System Query Processing
3	Research Of High-Dimensional Space Join And Query Algorithms Based On Main-Memory
4	Entity Resolution Technology Research Based On Multi-Source Data
5	Research On Similarity Join Processing Based On Entity
6	Temporal Join Processing with Hilbert Curve Space Mapping
7	Research On Key Technologies Of Entity Resolution For Structured Data
8	Research Of High-Dimensional Space Query Algorithm Based On Space-Filling Curves
9	Multi-Join Query Algorithm Research Over Data Streams
10	Research On Data Query Optimization Algorithm Of Distributed Database