Research On The Method Of Entity Resolution In Big Data Environment

Posted on:2019-01-12

Degree:Master

Type:Thesis

Country:China

Candidate:N Zhan

Full Text:PDF

GTID:2428330590965955

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Nowadays,people are increasingly demanding information quality.How to obtain the complete,correct,and useful information quickly from mass data has become the research hotspot.In order to obtain a more complete information of a thing,it is necessary to fully describe the thing from multiple aspects,and this information is likely to come from a number of different data sources.In the big data environment,data from multiple sources and different structures often lacks unity,accuracy,and completeness,and entity resolution is particularly important in data fusion.While in the age of big data,due to the high time complexity,the traditional entity resolution method cannot deal with mass data sets.Therefore,the focus of entity resolution research is to ensure its effectiveness while improving efficiency.Specifically,the main work in this thesis is as follows:First of all,in order to solve the problem of low matching efficiency in entity resolution and difficult to deal with large data sets,based on the IterER algorithm,an entity resolution based on pattern rapid scanning algorithm(PRSER)is proposed.The algorithm divides the data into multiple blocks and uses the pattern rapid scanning algorithm(PRSA)to filter the same elements of the records within each block.Only the different elements are compared to reduce the pattern matching time.Then,use the pattern extraction algorithm(PEA)to obtain a common pattern to represent a set of similar records.Compared with the IterER algorithm on the Spark platform,the experimental results show that the PRSER algorithm is more efficient.Secondly,in order to solve the problem that the PRSER algorithm adds more unrelated instances in the process of pattern extraction,which leads to a decrease in the effectiveness of entity resolution,an entity resolution based on token index filtering algorithm(TIFER)is proposed.The algorithm sorts the records in the block,constructs the token index table by splitting,and uses the index table to find out the record pairs with high similarity for the next exact match.Due to reducing the participation of redundant modes and avoiding the addition of more unrelated instances,the accuracy of entity resolution is improved,and the algorithm can solve the problem that similar records cannot match successfully because of the change of substring position.Compared with the PRSER algorithm on the Spark platform,the experimental results show that the F-value of the TIFER algorithm is generally superior to the PRSER algorithm.To sum up,in the big data environment,combining pattern matching and parallel computing framework to study entity resolution methods to improve the efficiency and effectiveness of entity resolution algorithms has important theoretical and practical significance.

Keywords/Search Tags:

multi-source heterogeneous, big data, entity resolution, pattern matching, token index

PDF Full Text Request

Related items

1	Entity Resolution Technology Research Based On Multi-Source Data
2	Pattern Matching Method Of Heterogeneous Data Based On Attention Mechanism
3	Research On Entity Resolution For Heterogeneous Big Data Integration
4	Research On Key Techniques Of Entity Resolution For Big Data Integration
5	Entity Matching Across Multiple Heterogeneous Open Data Sources
6	Heterogeneous Entity Consistency Modeling And Truth Discovery Under Multi-source
7	Research On Multi Data Source Entity Matching In The Construction Of Knowledge Map
8	Automated Comparative Table Generation For Facilitating Human Intervention In Multi-Entity Resolution
9	Research On Entity Resolution Towards Uncertain Data Stream And Resource Optimization
10	Implement Method Research On The Integration Of Multi-Source Heterogeneous Data Based On XML