Research On Entity Resolution Towards Uncertain Data Stream And Resource Optimization

Posted on:2018-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhu

Full Text:PDF

GTID:2348330536452514

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Internet and the Internet of things,it can not be denied that data is important.And many online applications appear on the basis of data resulting that entity objects can be expressed by many records.Entity resolution is to identify the records that refer to the same entity object in the real world,and aggregate these records.Entity resolution is efficient to management of data quality and data integration,which is a key step to find the value of data.The application of online system promotes the generation of continuous data.Especially under the influence of uncertain data stream,traditional batch computation can not meet the demand of incremental processing.Faced with the new challenges of entity resolution that continuous data brings,partition technique is used to partition records and create inverted index achieving incremental processing model.The inverted index is treated as state and state is updated iteratively to run incremental entity resolution method.State is the basis of entity resolution and state management is the core operation of it.By researching state management,incremental entity resolution methods are proposed and some optimization ideas are put forward.The main contributions of this paper are as fillows:(1)We study the processes and technique of entity resolution,come to the conclusion that existing methods will result in much redundant computation on the basis of limited memory and batch computation and cause waste of resources and time.(2)We propose an incremental processing framework for entity resolution which takes advantage of updated state to process continuous data avoiding the heavy time and space cost brought by the duplicate computation of historical records and meeting the challenges from continuous data in big data era.(3)We propose an incremental single-machine join algorithm Inc-Join to implement the framework in the multi-core system.Inc-Join algorithm can not only avoid reduplicate computation but also ensure the completeness of the result.In the next step we optimize the index by using dynamic partition strategy improving the efficiency again.(4)We propose an incremental and parallel join algorithm Inp-Join to implement the framework in the cluster environment of Spark platform to obtain the high efficiency taking advantages of in-memory computing,so we can obtain the result of entity resolution in time.(5)We study the resources of Spark and propose the prioritization scheme of Spark cluster.By analysis in theory and testing Inc-Join and Inp-Join algorithms on real datasets,it proves that our algorithms achieve high performance and outperform the existing methods.In addition,the structure of inverted index is clever and can filter records that are not similar accurately satisfying the needs of dealing with continuous data in big data era totally.

Keywords/Search Tags:

entity resolution, incremental, Spark cluster, inverted index, parallel computing

PDF Full Text Request

Related items

1	Parallel Search On Ciphertext Based On Index In Cloud Computing
2	Design And Implementation Of A Distributed Hybrid Index Structure Based On Spark
3	Design And Implementation Of Multi-Keyword Parallel Ciphertext Retrieval System Based On Inverted Index
4	Data Index Technology Research Based On Parallel Computing Platform
5	Research On Algorithm For Incremental Updating Association Mining Based On Inverted Index
6	Research And Application On Three-Decision KNN Algorithm Based On Incremental Learning
7	Research On Entity Resolution Method Of Industrial Internet Of Things Data
8	A Research Of Full-Text Retrieval Based On Inverted Index
9	Research And Implementation Of Memory Optimization Based On Parallel Computing Engine Spark
10	Research On Parallel Computing Based On Spark