Heterogeneous Entity Consistency Modeling And Truth Discovery Under Multi-source

Posted on:2018-03-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S Yang

Full Text:PDF

GTID:1318330512986004

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays,the explosive growth of real-world data scale and the rising popularization of computing resource provide great opportunities for commercial value mining.As the basic factor of data analysis and leveraging,data quality plays a significant role in the effectiveness of commercial decision-making.However,the irregularity of data quality caused by the large data volumes,extensive data types and various data acquisition methods,will surely lead to knowledge leveraging and decision making errors.As one of the major origins of low data availability problem,entity inconsistency will result in conflicts between various descriptions of various data sources to the same entity object.The consistency and validity of the entity data are critical to improving both data management and data quality.A growing line of work has been undertaken focused on data cleaning,data availability improving,data source independence maintenance and data fusion etc.,and extensive academic achievements have been made taking the research targets of traditional database and structured data objects.However,the cross-sourced,heterogeneous and large-scale Web data aggravate the low portability and low scalability problems,which is the bottleneck of the traditional methods.In light of this,this work aims to improve data quality and further provides intelligent support for user's decision making by leveraging heterogeneous entities in multiple data sources.To address the above problems,this work makes in-depth studies on object similarity measuring and feature correlation analysis based entity identity identification and determination for cross-sourced entities,truth probability distribution inference based on the identified same entities,and unified automatic data quality calculation and evaluation framework establishing for large-scale full sample data sources according to several novel evaluation criteria based on the consistency of the entities and the correctness of the object attribute in the data source.The main work and contributions of this paper can be concluded as:1.This work proposed an identity recognized model of heterogeneous entities in multi-source environmentEntity identity recognition and entity data consistency maintenance in the multi-sourced heterogeneous environment are the basis of data cleaning and knowledge fusion.Due to the heterogeneous,cross-domain and inconsistent complex features of multi-source data,the inconsistent processing method adopted for structured data processing in the traditional database will decrease the calculation efficiency and accuracy seriously.To cope with that,this work proposes a joint iterative method IBJI for entity recognition based on object similarity measuring and feature relevance analysis,which offers adaptive entity identity measurement in high-precision.Specifically,a non-linear similarity measuring model and a multidimensional weight parameter optimizing method are firstly constructed for measuring the similarity of various objects accurately and consistently.To address the feature missing problem caused by the diversity of heterogeneous features and the limitation of training sets,an optimized iterative model is also proposed to estimate the weights and parameters w.r.t unknown features for joint entity recognizing on multi-source heterogeneous data with the characteristic of object relationship optimization,training set automatic expansion and feature correlation analysis.We conduct our experiment on both homogeneous and heterogeneous datasets.The results demonstrate that the proposed method characterize entity identity more precisely and adaptively compared with the benchmark clustering method,ABS entity identification method,and relation based entity identification method in different data dimensions and scales.2.This work constructed a truth discovery model for multi-source heterogeneous data on the WebUsually,there are inconsistencies or even conflicts among the descriptions provided by multiple Web data sources,which pose a great challenge to user's decision-making.One of the primary means of solving this problem is truth discovery methods.The existing methods are mostly based on heuristic voting iteration method,which obtains the highest data as the truth value.However,the existing work ignores the positive correlation between the reliability and weight of the data source and the authenticity of the provided data,their calculation accuracy can not meet the anticipated target.This work proposes a Multi-objective Constraint-based Composite Gaussian Model,so called MCCGM,to cope with the challenges caused by the complexity of entity features,the diversity of data categories,and the randomness of conflict distributions.Specifically,after a in-depth study on 1)the relevance of sources weights and Gaussian distribution expectation;2)the cluster features of claimed values from multi-source and probabilistic distribution of truths;and 3)the interactions between truths of multi-attribute of object,a probability model is constructed to jointly model the truth discovery process formalized on dependent multi-feathers of multi-objects.Moreover,an improved EM iteration method is also proposed to support fast convergence.Extensive experiments have been constructed on the datasets of the weather forecast,flight,and electronic commerce.The experimental results demonstrate that the proposed MCGCM performs more precise compared with the state-of-the-art methods.3.This work conducted a quality evaluation model for heterogeneous e-commerce data sourcesThe ranking of data sources is the key to assisting the Web information selection.Most of existing data source ranking approaches depend on the manual score or competitive ranking score of particular search engine with strong subjective tendencies.This work presents an objective and intelligent ranking method that is able to calculate,score and rank the Web data source in the same type and same domain automatically and uniformly.Based on the identity of the heterogeneous entities,the paper defines 14 data source quality evaluation criteria according to the contributions of the data source to entity data's authenticity for maximum difference normalization in both positive and negative ways on each quality value.Moreover,each of the quality values of full sample distribution data is computed by adopting kernel density estimation approach and a Gaussian kernel function.Further,this paper presents a standard metrics based complete dataset quality assessment method,so called CDQA,to calculate the comprehensive data quality values of each Web platform by translating the Internet data source quality assessment problem into a multi-attribute decision analysis problem and further calculating both the subjective and objective weights according to multi-objective planning.Experiment on E-commerce datasets shows that the proposed CDQA method outperforms both FQA and SVR based methods in precision with the ground-truth training and analytic expert data.

Keywords/Search Tags:

Heterogeneous data under multi-source, Data quality, Entity consistency, Truth discovery, Sources assessment

PDF Full Text Request

Related items

1	Research On Algorithm Of Truth Discovery Based On Estimation Of Statistical Feature Parameters
2	Research On Key Technologies Of Truth Discovery On Dirty Data
3	Research On Truth Discovery Based On Bayes Model In Web Data Integeration
4	Research On Data Quality Assessment Of Multi-source Heterogeneous Location Data
5	Research On Data Quality Assessment Of Multi-source Data Fusion
6	Research And Application Of Truth Discovery On Entity Attribute Correlation And Domain Awareness
7	Text Data Truth Discovery Based On Self-confidence Of Sources
8	Research On Key Technologies Of Quality Control And Truth Discovery System For Meteorological Data
9	Entity Matching Across Multiple Heterogeneous Open Data Sources
10	Study Of Collaborative System Integration For Multi-Source Heterogeneous Data In Multiple Formats Enterprises