Font Size: a A A

Keyword Search Based On Real-time Entity Resolution

Posted on:2019-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:X DuFull Text:PDF
GTID:2428330569479254Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data quality is of importance in the researches and applications in the fields of database and big data,in which the processing of dirty data is a challenge.Entity resolution(ER),one of the critical issues,is the process of extracting and matching records/tuples from one or multiple data sources with structured,semi-structured or unstructured data that refer to the same real-world entity,and then merging the matching records into a tuple or a cluster.Traditional methods of keyword query/search in relational databases are based on clean data without entity resolution,and as a result,the answers to a query may contain duplicates over dirty datasets with duplicate tuples that have different identifiers and refer to the same real-world entity.Moreover,the traditional methods of top-N query processing may fail to work correctly on dirty datasets with duplicate data;meanwhile,the traditional techniques for ER and the strategies for data cleaning in data warehouse cannot be employed in queries directly as those techniques and strategies are offline and enormously expensive.Thus,algorithms need to be presented to integrate the processing of top-N keyword query with real-time ER,thereby handling the queries and deduplicating the retrieved results over dirty data.A method based on real time entity resolution for processing top-N keyword queries is provided in this dissertation.This method constructs an index table,then the tuple words and related information from a database system will be stored in the index table;next,an index is created by using the index table,the index is employed to determine the candidate set of a given top-N keyword query,and a similarity function is defined by the information such as term frequency and document frequency in the index;a cluster algorithm for real-time entity resolution is also designed with the idea of divide-and-conquer;finally,deduplicated results of the top-N query are displayed.In the experiment,three datasets are used and a SIMPLE method is defined as a base line to measure the performance of the method KEYSER presented in this dissertation.The experimental results indicate that the KEYSER can outperform the SIMPLE by 1-5 order of magnitude for dealing with ER,and the SIMPLE method cannot satisfy the real-time requirement of a top-N keyword query.Moreover,the effectiveness of the traditional query method is compared with that of the KEYSER method.The experimental results also show that the traditional query method is invalid for dirty data,while the KEYSER,combining of Top-N keyword query with real-time entity resolution,is of efficiency and effectiveness for both dirty and clean datasets.
Keywords/Search Tags:Entity resolution, Relational database, Similarity function, top-N keyword query, Data quality
PDF Full Text Request
Related items