Font Size: a A A

Research On Iterative SNM-based Entity Resolution Method And Optimization Strategies

Posted on:2015-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:T M WangFull Text:PDF
GTID:2348330482955997Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Entity resolution is a difficult problem in data mining, information fusion and other fields, one can determine whether the records from one or more data sources represent the same entity by entity resolution technology. Testing and integration of duplicate records produced in data integration, which can effectively eliminate the inconsistency inside a data source or between data sources.However, with the development of the Internet, network data is showing explosive growth. How to effectively apply the entity resolution technology on sophisticated large large scale data environment is a research hotspot of scholars at data mining, information fusion and other fields around the world. The goal is to obtain high quality of recognition results, or high entity resolution efficiency. Correspondingly, current work focuses on iteration-based entity resolution or sorted neighborhood (SNM)-based entity resolution. The former iteratively merges similar records to acquire higher precision and recall. But this method is of high time complexity. The latter only compares the records within the same sliding window to maintain higher performance. The advantage is high matching efficiency, disadvantage is difficult to guarantee the quality of recognition results.We propose an iterative SNM based entity resolution method in this thesis. The method will combine the advantages of iteration-based method and SNM-based method. And it has the advantage of high quality and high performance at the same time. We mainly made the following contributions in this thesis:(1)We systematically introduce the research status at home and abroad of entity resolution problem, briefly summarize the representative related work, and point out their advantages and disadvantages, then analyze the deficiency of present research.(2)We propose an iterative SNM based entity resolution method called SIER(Entity Resolution Method based on Iterative SNM), which divide the entity resolution process into two stages:In the first stage, the records are initially matched based on sliding window. In the second stage, the matching result is rectified iteratively to improve the quality of the result. Only the records in the iterating windows are compared, to ensure the high efficiency of the algorithm.(3)To improve the SIER method, we put forward two kinds of optimization strategies: ISIER method is based on the tag of records and IISIER method is based on the tag of clusters, which could effectively reduce unnecessary comparisons and further improve the efficiency of SIER entity resolution method.(4)The experimental results verified the feasibility and the effectiveness of key technique proposed in this thesis. Compared with other entity resolution methods, to achieve the same effect of ER methods, the number of record-pair comparisons needed in SIER method is much less. In addition, compared with the SIER method, the two kinds of optimization strategies ISIER and IISIER can significantly improve the efficiency of entity resolution.
Keywords/Search Tags:entity resolution, iteration, sorted neighborhood method(SNM), entity matching, optimization strategies
PDF Full Text Request
Related items