Font Size: a A A

Real-time Entity Resolution And Query Processing Based On Region-tree Indexing

Posted on:2020-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:W P LuFull Text:PDF
GTID:2428330596985213Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Entity resolution and query processing are two important research topics in the database and big data fields.Entity resolution refers to the process of identifying multiple tuples in a data set that describe the same entity in the real world and merging these tuples into a cluster or a tuple.For big data and low-quality data sets with duplicate tuples,if entity resolution is not considered,traditional query processing methods may be inefficient or even invalid.Traditional entity resolution techniques cannot be directly applied to query processing.How to make entity resolution methods validity and scalability is still an open question.Therefore,a new processing method is needed,which can remove duplicate tuples in real time and can quickly process queries.To this end,this dissertation builds a region tree index,and then presents the techniques of real-time entity resolution and methods of query processing based on this index.The region-tree index is constructed to perform real-time entity resolution on data sets in the n-dimensional data space Rn.In an n-dimensional data set,each tuple is an n-dimensional real vector.Duplicate tuples in an n-dimensional dirty data set are identified and clustered with duplicate tuples as follows:First,a partitioning algorithm PRC is introduced that divides the minimum region containing the data set dynamically,and partitions one region into several disjoint sub-regions each time.Second,the n-dimensional region space is used to construct the region-tree index in the partitioning process.Finally,the divide and conquer mechanism is employed to perform real-time entity resolution efficiently,that is,a large data set is partitioned into several smaller data sets,and then the smaller data sets are resolved one by one applying the region-tree index.This method not only reduces the configuration requirements of physical equipment,but also makes the entity resolution fast and effective.By using the region-tree index and the results of entity resolution,the corresponding algorithms of point query,region query and KNN query processing are provided,and the results of query are different clusters or their representatives.For a point query,the region-tree index is utilized to find the leaf node quickly where the query point is located,by using the binary search method to find the query point position and its result in the linked list of the leaf node.For a region query,by comparing the leaf nodes of the region-tree index with the query region,the leaf nodes will be found that intersect or contain the query region.Using the binary search in the linked list of these nodes,the nearest tuple of the center of the region will be obtained according to the attribute values in the list.Sequentially scanning other nodes within a certain threshold,the tuples in the query region are retrieved.For a KNN query,by related algorithms of processing point query and region query,and dynamically obtaining the radius of the region of the KNN query,its K nearest neighbor tuples are returned.For the proposed methods of real-time entity resolution and query processing,based on fifteen dirty data sets with the dimensionality n ranging from 2 to 784,extensive experiments are conducted to demonstrate the performances of the methods.The dirty datasets are derived from clean datasets with the different cardinality,different dimension and different distribution.The experimental results show that the region-tree index and algorithms in this dissertation can resolve entity in real-time with sub-seconds and process queries efficiently and effectively.
Keywords/Search Tags:N-dimensional data space, Dirty data, Region-tree Index, Entity resolution, Query processing
PDF Full Text Request
Related items