Font Size: a A A

Research On The Method Of Entity Resolution In Big Data Environment

Posted on:2019-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhanFull Text:PDF
GTID:2428330590965955Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays,people are increasingly demanding information quality.How to obtain the complete,correct,and useful information quickly from mass data has become the research hotspot.In order to obtain a more complete information of a thing,it is necessary to fully describe the thing from multiple aspects,and this information is likely to come from a number of different data sources.In the big data environment,data from multiple sources and different structures often lacks unity,accuracy,and completeness,and entity resolution is particularly important in data fusion.While in the age of big data,due to the high time complexity,the traditional entity resolution method cannot deal with mass data sets.Therefore,the focus of entity resolution research is to ensure its effectiveness while improving efficiency.Specifically,the main work in this thesis is as follows:First of all,in order to solve the problem of low matching efficiency in entity resolution and difficult to deal with large data sets,based on the IterER algorithm,an entity resolution based on pattern rapid scanning algorithm(PRSER)is proposed.The algorithm divides the data into multiple blocks and uses the pattern rapid scanning algorithm(PRSA)to filter the same elements of the records within each block.Only the different elements are compared to reduce the pattern matching time.Then,use the pattern extraction algorithm(PEA)to obtain a common pattern to represent a set of similar records.Compared with the IterER algorithm on the Spark platform,the experimental results show that the PRSER algorithm is more efficient.Secondly,in order to solve the problem that the PRSER algorithm adds more unrelated instances in the process of pattern extraction,which leads to a decrease in the effectiveness of entity resolution,an entity resolution based on token index filtering algorithm(TIFER)is proposed.The algorithm sorts the records in the block,constructs the token index table by splitting,and uses the index table to find out the record pairs with high similarity for the next exact match.Due to reducing the participation of redundant modes and avoiding the addition of more unrelated instances,the accuracy of entity resolution is improved,and the algorithm can solve the problem that similar records cannot match successfully because of the change of substring position.Compared with the PRSER algorithm on the Spark platform,the experimental results show that the F-value of the TIFER algorithm is generally superior to the PRSER algorithm.To sum up,in the big data environment,combining pattern matching and parallel computing framework to study entity resolution methods to improve the efficiency and effectiveness of entity resolution algorithms has important theoretical and practical significance.
Keywords/Search Tags:multi-source heterogeneous, big data, entity resolution, pattern matching, token index
PDF Full Text Request
Related items