Font Size: a A A

Research Of Machine Learning Algorithms On Heterogeneous Data

Posted on:2018-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:J Y FuFull Text:PDF
GTID:2348330512486729Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Traditional machine learning algorithms often has one fundamental assumption that data is homologous,that is,the training data and testing data should be sampled from an identical probability distribution.However,the homologous data is very scarce in the real world and limited homologous data can not train a strong machine learning model.This is the problem of homologous data scarcity.One way to solve this problem is to construct homologous data manually,but this method is too expensive.Another effective way to solve this problem is to integrate heterogeneous data with different distributions while training,thus machine learning algorithms for heterogeneous data is very important.According to whether the sample space is the same,heterogeneous data can be divided into isomorphic data and non-isomorphic data.In order to solve the problem of homogeneous data scarcity,we can collect the annotated samples by crowdsourc-ing.Each annotator is treated as a data source,and the data we collect is isomorphic data.Machine learning algorithms for this kind of isomorphic data is called Learning from Crowds(LFC).According to the number of steps of obtaining the target classi-fier,existing LFC approaches fall into two categories,namely two-stage approaches and direct approaches.The Personal Classifier(PC)approach is a representative algo-rithm in directs approaches.PC approach has a convex objective function but makes strong assumptions about parameters' distribution.In this thesis,we propose a new non-parametric approach,called NP approach,for learning from crowds.NP approach has a convex optimization formulation but without assumptions about parameters' dis-tribution.Another way to solve the problem of homologous data scarcity is using the knowl-edge in assist domain to help the process of training model in target domain.The sam-ple space and distribution of the data in different domain are different and therefore they are heterogeneous data.The machine learning algorithms for this kind of non-isomorphic data is called Transfer Learning(TL).According to the way of knowledge transfer,existing TL approaches fall into three categories,namely transfer knowledge of instances,transfer knowledge of feature representations and transfer knowledge of model parameters.In this thesis,we propose an approach for transferring knowledge of model parameters and another approach for transferring knowledge of model parame-ters and instances.Both of these approaches can use the knowledge on assist domain to improve the efficiency of the model in target domain.
Keywords/Search Tags:Machine Learning, Heterogeneous Data, Isomorphic Heterogeneous Data, Non-isomorphic Heterogeneous Data, Learning from Crowds, Transfer Learning
PDF Full Text Request
Related items