Font Size: a A A

Research On Key Technologies Of Multi-Source Heterogeneous Data Fusion

Posted on:2021-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:C FengFull Text:PDF
GTID:1368330605981265Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of big data technology,data fusion,which is based on machine learning theory and supported by sensing data,has become a hot research field and has been widely used in various smart city systems,such as smart healthcare,smart home,and smart transportation,etc.With the increasing amount of sensing data,the differences in data types,data relationships,and data quality are increasing.Besides,there are a lot of unlabeled data,sparse data areas,and domain knowledge.Furthermore,the problem of distributed multi-source heterogeneous data fusion caused by data privacy,data security,and transmission restrictions cannot be ignored.In this thesis,four key problems of multi-source heterogeneous data fusion,including single model data fusion,structured data fusion,cross-domain knowledge fusion and data fusion in the distributed environment,are studied and explored.The proposed methods are verified based on real-world data.The results achieved are as follows:1.To solve the problems of multi-source heterogeneous data fusion,this thesis proposes an algorithm based on random forest,called MCS-RF.The proposed algorithm is a single model,which combines offline semi-supervised random forest and online semi-supervised random forest.The proposed algorithm can solve the problems caused by heterogeneous,sparse and unlabeled data in unstructured multi-source heterogeneous data fusion.To verify the effectiveness of the proposed algorithm,fine-grained PM2.5 real-time inference in Beijing is taken as an example.The experimental results show that MCS-RF can effectively fuse multi-source heterogeneous data and improve the inference accuracy.2.To solve the problems of multi-source heterogeneous data fusion,this thesis proposes a multi-source heterogeneous data fusion algorithm based on ensemble learning.Different from MCS-RF,the proposed algorithm completes data training by constructing multiple independent sub-models.This algorithm analyzes and models data features such as time-series attributes,spatial topology,and real-time data that are often found in urban sensing data.The ensemble of sub-models is achieved through a neural network.To verify the effectiveness of the proposed algorithm,the fine-grained air quality estimation in Beijing is achieved based on urban sensing data.The experimental results show that the proposed algorithm can effectively utilize the features of multi-source heterogeneous data and improve the inference accuracy.3.To solve the problem of cross-domain knowledge and data fusion,this thesis proposes a cross-domain knowledge fusion algorithm based on machine learning.This algorithm approximates the domain knowledge model,and uses the data to train and solve the approximate model parameters,so as to solve the deployment problem of the domain knowledge model in urban sensing data.This thesis takes air quality prediction as an example to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed machine learning based cross-domain knowledge fusion algorithm can effectively utilize the cross-domain knowledge and improve the accuracy of prediction.4.To solve the problem of data fusion in the fog computing environment,this paper proposes a multi-source heterogeneous data fusion mechanism including the local heterogeneous data fusion system and the centralized homogeneous data training system.The proposed mechanism uses a parameter averaging method based on data volume and data quality to iteratively optimize the model.This paper takes the environmental monitoring problem in the fog computing environment as an example to verify the effectiveness of the proposed mechanism.In the experiment,urban sensing data is divided into simulated data distributions in a fog computing environment,and the proposed mechanism is verified on the Independent Identically Distributed(?D)data and non-?D data.Experimental results show that the proposed mechanism achieves high-precision model training without data sharing,which can solve the problems of data sparsity,model overfitting,data heterogeneity,and model heterogeneity,etc.
Keywords/Search Tags:big data, multi-source heterogeneous data fusion, knowledge fusion, neural network, machine learning
PDF Full Text Request
Related items