Font Size: a A A

Research On Data Fusion For Web Data Integration

Posted on:2013-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y X ZhangFull Text:PDF
GTID:1118330374480713Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web technology is becoming fashionable all over the world because of its wideness, interactivity, rapidness and openness, and has permeated various fields of society. How to integrate large-scale and high-value Web information accurately and efficiently is particularly important for such analytical applications as market intelligence, public opinion analysis and business intelligence and has good application effect and broad prospects. However, web data have more various forms, more free expression and more casual release than the data sources of traditional data integeration. These characters of Web data cause integrated results more redundant, inaccurate and discrete, and greatly affect the quality of integrated data. Thus, it plays important role not only for integrated data quality but also for further analysis and mining how to eliminate redundancy, sift the truth, link data and fuse the integrated result.As an important part of the Web data integration, Web data fusion is the quality assurance of integrated data and the precondition of accurate analysis and mining, and has become one of the hot research topics at present. The thesis mainly focuses on improving the quality of integrated data quality and provides high-quality data support for analytical applications. However, as the characteristics of web data such as various forms, free expression and casual release, there are the following issues that need to be resolved in the research of web data fusion:(1) As web data have various forms and different expressions of the same object vary greatly, web data fusion firstly needs to identify the different references of the same entity in order to construct the full view of the target entity.(2) Because of the freedom of web data release and the different level of information providers, there exist widely incomplete, out-of-date and even false information in Web. Therefore, web data fusion needs to resolve data conflicts amony the data from different data sources.(3) Web data integration pays attention to more various entity types than traditional data integration, including not only target entity information but also relative entity around the target entity. Therefore, web data fusion needs to link the multiangular data about the same entity in order to provide more comprehensive entity views to users.(4) Web data fusion as a whole is black-box process for users, and it can make the data fusion process lacks interpretability and debuggability. Therefore, it needs to construct trackable mechanism for data fusion in order to make users can derive the data origin and evolution as well as manually participate in the process of data fusion.This thesis aims at improving the quality of integrated data in web data integration and places focus on the issues that need to be resolved. The main research works and contributions are as follows:(1)Due to the characteristics that web entities often have various expression variants and lose attribute values, a collective approach based on quasi-cliqua similarity is proposed for entity resolution. The approach can improve the accuracy of web entity resolution.To consolidate the different entity expression and eliminate redundant data, it needs to identify the different references about the same entity, i.e. entity resolution. To solve this issue, a collective approach based on quasi-cliqua similarity is proposed, which resolve each matching pairs according to the mutually reinforcing of matching desicions and can improve the accuracy of entity resolution. On the similarity measure side, this approach comprehensively utilizes three similarity measures including attribute-based similarity, context-based similarity and relation-based similarity, and it can overcome the limitations resulted from various entity expression variants and lost attribute values. Especially, this approach uses quasi-cliqua to measure relation similarity and enhances the accuracy of entity resolution. On the effiency side, this approach uses blocking to group the matching references by candidate key and improves the efficiency.(2) Due to the characteristics of dynamics of web data integration and freedom of web information release, a2-layer approach based on Markov Logic Networks is proposed for data conflict resolution. The approach can effectively resolve data conflict of web information.To identify the truth from web data, it needs to resolve data conflict from web information. To solve this issue, a2-layer approach based on Markov Logic Networks is proposed for data conflict resolution. This approach can divide different attributes according to their conflict degree and carry on2-stage data conflict resolution. Because of considering the influence of week conflicting attributes to strong conflicting ones, this approach can improve the accuracy effectively. Through observing and analyzing the characteristics of conflicting data and data sources, we extract and use multi-angle features and rules for true value inference. Experimental results using a large number of real-world data collected from two domains show that the proposed approach can effectively combine these features and rules and significantly improve the accuracy of data conflict resolution.(3) Due to the characteristics that web data sources are autonomous strongly and entity expressions are inconsistent, an approach based on2-layer Conditional Random Fields is proposed for linking relative unstructured data with structured entities. The approach can effectively match reviews with database entities.To match relative entities with target enties, it needs to construct the linking between relative unstructured data and structured entities. To solve this issue, an approach based on2-layer Conditional Random Fields is proposed for linking reviews with database objects, which leverages the integrated structured entity and significantly reduces the dependence on manually labeled training data. For named entity recognization, this approach employ semi-Markov CRF to recognize the entities in reviews and exploit a variety clues including entity-level dictionary features, thereby effectively resolving the entity variety and improving the accuracy of the entity recognition. Finally, our experiments and extensive analysis show that this approach can effectively matching reviews to database entities and can improve the matching accuracy.(4) Due to the characteristics that different data fusion stages are mutually isolated and data fusion process loses interpretability, a trackable mechanism for data fusion is proposed. The mechanism can make users can derive the data origin and evolution.To make fusion results interpretable and fusion process debuggable, it needs to construct a trackable mechanism for data fusion. To solve this issue, a trackable mechanism based on data provenance is proposed for data fusion. In this thiese, PI-CS is used to express the data origin and it is more accurate than traditional Lineage-CS. For record the evolution process of fusion result, we proposed two transformation provenances, on is ER Provenance which records the process of entity resolution, the other is DCR Porvenance which records the process of data conflict resolution.
Keywords/Search Tags:Web Data Integration, Data Fusion, Entity Resolution, DataLinking, Data Conflict Resolution
PDF Full Text Request
Related items