Font Size: a A A

Research On Entity Resolution In Web Data Integration

Posted on:2012-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:A T DongFull Text:PDF
GTID:2218330338962896Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the web, network data increase exponentially as the indexes. The increasing requirements for network data made that key word and full-text search could have not filled the need. These years, the study on network data management and query has given rise to query on better-granularity and data integration. Network data has not unified and permanent model, it is irregular and it fluctuates, so a tidy sum of dirty data comes in flocks. Dirty data is the cause of spelling problems, illegal values, null values, inconsistent values, abbreviations and so on.It critically influenced the high reliability of data. For efficient management of useful information on the web, the researchers moot the concept of web data integration. Web data integration is the problem of rearrange network data from many data sources according to their relation to meet next needs for network data.With the development of information technology, many commercial, national economic policies and other information can be seen on the Internet. Internet has become a major information source of market intelligence analysis. For small and medium enterprises, limited funding, lack of technical manpower made that they can't carry out perfect analysis of market intelligence. It is of great economic significance how to efficiently manage and use web data, and to improve the efficiency of market intelligence analysis. Network data is dynamic, diverse, and semi-structured and unstructured, so it has become a huge challenge how to access to valuable network information quickly and accurately and provide valuable data for market intelligence.The framework of web data integration generally consists of the following parts: domain model construction, data sources process, data extraction, pattern matching, entity resolution. Entity resolution is one of the most critical issues in the framework of web data integration. In order to avoid "garbage in, garbage out", we must strive to improve data quality. The data quality directly affects quality of service provided to users, so Entity Resolution must be studied in any framework of web data integration. After data extraction and pattern matching, entity resolution may face two situations:first, entity is a single type. In this case, Entity Resolution (ER) is the problem of identifying which references in a database refer to the same real-world entity, we should make every possible effort to reduce the number of comparisons. Second, multi-type entities exist in the data warehouse, in this case, we should identify all types references in a database referred to the same real-world entity,and use previous results to improve the results of following Entity Resolution.In Web Data Integration, there are new challenges for Entity Resolution and we proposed new solutions for them:1) In the case of single-type entities, we proposed a extremely efficient solution. First, we proposed a efficient blocking-based approach, the references were divided into two categories, one is the different references refer to the same real-world entity, the other is the same references refer to the different real-word entity. Two categories were resolved by using two different algorithms, two different algorithms improve each other, so our solution greatly improved the efficiency and accuracy of Entity Resolution.2) In the case of multi-type entities, we proposed a new solution. In our solution, we resolve multi-type entities collectively, and used previous results to improve the results of following Entity Resolution.
Keywords/Search Tags:web data integration, entity resolution, references, matching-relation graph
PDF Full Text Request
Related items