Font Size: a A A

Research On Entity Resolution Towards Uncertain Data Stream And Resource Optimization

Posted on:2018-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhuFull Text:PDF
GTID:2348330536452514Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet and the Internet of things,it can not be denied that data is important.And many online applications appear on the basis of data resulting that entity objects can be expressed by many records.Entity resolution is to identify the records that refer to the same entity object in the real world,and aggregate these records.Entity resolution is efficient to management of data quality and data integration,which is a key step to find the value of data.The application of online system promotes the generation of continuous data.Especially under the influence of uncertain data stream,traditional batch computation can not meet the demand of incremental processing.Faced with the new challenges of entity resolution that continuous data brings,partition technique is used to partition records and create inverted index achieving incremental processing model.The inverted index is treated as state and state is updated iteratively to run incremental entity resolution method.State is the basis of entity resolution and state management is the core operation of it.By researching state management,incremental entity resolution methods are proposed and some optimization ideas are put forward.The main contributions of this paper are as fillows:(1)We study the processes and technique of entity resolution,come to the conclusion that existing methods will result in much redundant computation on the basis of limited memory and batch computation and cause waste of resources and time.(2)We propose an incremental processing framework for entity resolution which takes advantage of updated state to process continuous data avoiding the heavy time and space cost brought by the duplicate computation of historical records and meeting the challenges from continuous data in big data era.(3)We propose an incremental single-machine join algorithm Inc-Join to implement the framework in the multi-core system.Inc-Join algorithm can not only avoid reduplicate computation but also ensure the completeness of the result.In the next step we optimize the index by using dynamic partition strategy improving the efficiency again.(4)We propose an incremental and parallel join algorithm Inp-Join to implement the framework in the cluster environment of Spark platform to obtain the high efficiency taking advantages of in-memory computing,so we can obtain the result of entity resolution in time.(5)We study the resources of Spark and propose the prioritization scheme of Spark cluster.By analysis in theory and testing Inc-Join and Inp-Join algorithms on real datasets,it proves that our algorithms achieve high performance and outperform the existing methods.In addition,the structure of inverted index is clever and can filter records that are not similar accurately satisfying the needs of dealing with continuous data in big data era totally.
Keywords/Search Tags:entity resolution, incremental, Spark cluster, inverted index, parallel computing
PDF Full Text Request
Related items