Font Size: a A A

Real-time Entity Resolution And Keyword Query By Multiple Indices

Posted on:2021-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:R D CuiFull Text:PDF
GTID:2428330620970572Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Traditional techniques of keyword top-N query is mostly based on clean dataset,and then it is hard for users to apply them directly in dirty data context.Queries against dirty dataset involving misspellings,null values or duplicates may not return reliable results,which may affect the correctness of decision-making analysis and even obtain wrong conclusions.Traditional techniques of entity resolution(ER)identify and merge the duplicate records in a dirty dataset to obtain a corresponding clean dataset.However,those techniques of ER are time-consuming and difficult to integrate with query algorithm;therefore,it is necessary to design algorithms of real-time ER by using blocking methods resolving a query record in sub second time.The techniques of ER and keyword top-N query are studied in this dissertation for dealing with three types of dirty data,including duplicates,misspellings or null values.The main work in this dissertation includes:(1)Multiple indices based on multiple attributes are created,each of them divides a dataset according to the different attribute value,and then a global index is formed to generate candidate tuples;while different structures such as hash table,skip list and B~+-tree are used in creating indices.(2)The ER ranking function and algorithms are designed for the indices.The ranking function is based on the edit distance,the number of identical attribute values between the tuples,the length of each attribute values and other factors,and then the ranking function determines whether the two tuples are referring to a same entity in the ER process of a dirty dataset.The ER algorithms are employed to block a dataset for reducing the number of candidate tuples and improving the efficiency of ER process;meanwhile,unnecessary calculations are avoided,and then the elapsed time of ER is reduced.Thus,a query record can be resolved in sub second time.(3)Based on multiple indices,two methods are presented to process keyword top-N query against a dirty dataset,one uses the result of ER,while the other integrates with ER on the fly.A corresponding ranking function is designed by using multiple factors such as the number of matching attributes,the importance of each attribute,the number of matching terms,and then the keyword query results are sorted by the ranking function.A variety of dirty datasets based on real datasets are synthesized for the experiments in this dissertation,including different sizes,duplicates,misspellings or null values.Extensive experiments are conducted to evaluate the effectiveness and efficiency of the proposed methods of ER and keyword top-N query for these dirty datasets.
Keywords/Search Tags:Entity resolution, Keyword query, Ranking function, Multiple indices
PDF Full Text Request
Related items