| With the rise of the mobile Internet,society has entered the era of big data.Data is usually information describing an object,such as the height of Mount Everest,which we can collect from different sources.However,not all sources are equally credible,and there is inevitably noise in the data they provide.Therefore,the authenticity of big data needs to be analyzed urgently.Manually marking to resolve data conflicts requires a lot of time and manpower,which is obviously unrealistic for massive big data.Therefore,in order to automatically identify correct information from multi-source data,truth discovery has emerged as an important fundamental research topic.At present,there are two researches that need to be improved for the truth discovery technology for data integration:(1)The truth discovery problem based on entity attribute correlation,there are various correlations between entity attributes,and these correlations will affect the accuracy of the truth discovery result.(2)For the problem of truth discovery based on domain awareness,the reliability of sources varies in different domains.By dividing the reliability of sources in a fine-grained manner,the accuracy of truth discovery results can be further improved.This paper uses the relevant theories,techniques and methods of data mining to systematically study the above two issues.The main research contents are as follows:Firstly,aiming at the problem of truth discovery of entity attribute correlation,this paper proposes a truth discovery model GETD based on graph embedding relatiton perception.By constructing four kinds of heterogeneous networks,including source-source,source-entity attribute value,entity attribute-entity attribute and entity attribute-entity attribute value network,the relationship between data is modeled.Then these networks are embedded in a low-dimensional space,so that reliable sources and reliable attribute values are close to each other,and the relationship between entity attributes is reflected on the attribute values,so as to conduct ground truth discovery inference.Experimental results on two real-world datasets validate that the GETD algorithm outperforms existing truth discovery algorithms.Secondly,for the domain-aware truth discovery problem,this paper proposes a domain-aware truth discovery model DTD,which divides the reliability of sources into a fine-grained representation.In addition,in view of the problem that the performance of the existing truth discovery algorithms is limited by the uniform weight initialization of the source,this paper also proposes a fine-grained weight initialization method based on the richness of the domain information of the source.In this paper,the domain-aware truth discovery is regarded as an optimization problem,in which the reliability of the source and the credibility of the declared value are defined as two unknown variables,and the objective function is defined as the distance weighted between the declared value and the truth value.At the same time,in order to solve the optimization model,a two-step iterative update method is adopted,one step is to update the source weight,one step is to update the credibility of the declared value,and different loss functions are used to deal with different data types.Experimental results on two real-world datasets validate that the DTD algorithm outperforms state-of-the-art truth discovery methods.Finally,a prototype system for truth discovery is designed and developed.The system integrates the two algorithms proposed in this paper and other truth discovery algorithms,and mainly realizes the functions of datasets upload,truth discovery algorithms selection,and truth discovery result download.Users can upload datasets through the system,and select different truth discovery algorithms for data integration work,and finally download the datasets that complete the truth discovery step. |