Real-time Entity Resolution And Keyword Query By Multiple Indices

Posted on:2021-04-26

Degree:Master

Type:Thesis

Country:China

Candidate:R D Cui

Full Text:PDF

GTID:2428330620970572

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Traditional techniques of keyword top-N query is mostly based on clean dataset,and then it is hard for users to apply them directly in dirty data context.Queries against dirty dataset involving misspellings,null values or duplicates may not return reliable results,which may affect the correctness of decision-making analysis and even obtain wrong conclusions.Traditional techniques of entity resolution(ER)identify and merge the duplicate records in a dirty dataset to obtain a corresponding clean dataset.However,those techniques of ER are time-consuming and difficult to integrate with query algorithm;therefore,it is necessary to design algorithms of real-time ER by using blocking methods resolving a query record in sub second time.The techniques of ER and keyword top-N query are studied in this dissertation for dealing with three types of dirty data,including duplicates,misspellings or null values.The main work in this dissertation includes:(1)Multiple indices based on multiple attributes are created,each of them divides a dataset according to the different attribute value,and then a global index is formed to generate candidate tuples;while different structures such as hash table,skip list and B~+-tree are used in creating indices.(2)The ER ranking function and algorithms are designed for the indices.The ranking function is based on the edit distance,the number of identical attribute values between the tuples,the length of each attribute values and other factors,and then the ranking function determines whether the two tuples are referring to a same entity in the ER process of a dirty dataset.The ER algorithms are employed to block a dataset for reducing the number of candidate tuples and improving the efficiency of ER process;meanwhile,unnecessary calculations are avoided,and then the elapsed time of ER is reduced.Thus,a query record can be resolved in sub second time.(3)Based on multiple indices,two methods are presented to process keyword top-N query against a dirty dataset,one uses the result of ER,while the other integrates with ER on the fly.A corresponding ranking function is designed by using multiple factors such as the number of matching attributes,the importance of each attribute,the number of matching terms,and then the keyword query results are sorted by the ranking function.A variety of dirty datasets based on real datasets are synthesized for the experiments in this dissertation,including different sizes,duplicates,misspellings or null values.Extensive experiments are conducted to evaluate the effectiveness and efficiency of the proposed methods of ER and keyword top-N query for these dirty datasets.

Keywords/Search Tags:

Entity resolution, Keyword query, Ranking function, Multiple indices

PDF Full Text Request

Related items

1	Keyword Search Based On Real-time Entity Resolution
2	Research On Key Technology Of Entity-based XML Keyword Search Processing
3	Keyword Query For RDF Data Based On Query Translation
4	Research On Knowledge Base Question Answering Methods Based On Query Graph Ranking
5	Research Of Entity Ranking Algorithm Based On Skyline Query
6	Research On Keyword Query In Personal Dataspace Management System
7	Evaluating Join Queries With Real-time Entity Resolution
8	Research On Keyword Query Over Relational Databases
9	Real-time Entity Resolution And Query Processing Based On Region-tree Indexing
10	Encrypted Database Fast Keyword Query Technology