Data mining and machine learning,especially computer vision and natural language processing,are playing an important role in our daily life.However,in spite of the successful application of data mining and machine learning,it often requires that all training data are labeled(i.e.,labeled data).Therefore,although supervised learning has been successfully applied to many real-world applications,it requires to manually label all the training data,which is both resource and time consuming.The problem becomes more challenging in the context of data streams,where large volume of data appear at a high speed and only a small fraction of the data can be labeled.As a result,semi-supervised learning(SSL)has been introduced,which ensures building a learning model with a better generation ability based on partial labeled data.Thus,semi-supervised learning is applied to many tasks,including classification,clustering,regression analysis and data stream mining.However,some empirical evidence and theoretical characterization have demonstrated that the performance of traditional semi-supervised approaches can be even worse than using labeled data only in some real world applications(This phenomenon is called the reliability of semi-supervised learning).Therefore,building a reliable semisupervised model is an important scientific task and also has a meaningful application value.Towards the uncertainty and the underlying unreliability of unlabeled instances,This thesis targets to model the reliability of all the unlabeled instances and learn their importance weights.The learned weights is afterwards integrated into the semi-supervised model.Basically,we expect to assign smaller reliabilities to those irrelevant and unsafe instances that may lead to the degradation of prediction performance,and higher weights to reliable instances.On the one hand,The key point of existing weight learning methods is to explore the prediction inconsistency or classification probability of classifier(s)to characterize the importance of each instance.However,since current algorithms are dependent of specific classifiers,usually the weights of points near to classifier separator are focused only.In addition,the correlation among instances and the whole data structure information are not well considered.On the other hand,the existing algorithm fails to handle the streaming data,due to their model complexity.To this end,in this thesis,I propose two reliable semi-supervised model,including ReSSL algorithm and RP algorithm,by learning the weights of unlabeled instances.Specifically,ReSSL model the instance reliability by measuring the consistency between cluster assumption and the intrinsic data structure.It integrates the cluster-level information and distribution information of labeled data for prediction with a KNN-style framework.And the basic idea of RP is to take local label regularity as a prior,and then perform reliability propagation on an adaptive graph.Beyond,distributed RP algorithm is introduced to scale up to large volumes of data.Towards reliable semi-supervised online learning,I propose other two algorithms,called ReSSL Stream algorithm and BLS algorithm.To sum up,innovation points of this thesis are as follows.(1)I propose two novel algorithms to learn the reliability weight of each unlabeled instance,which can be used for usability detection algorithm.(2)Building upon instances reliabilities,I propose two robust semi-supervised models for classification task.(3)To handle partially labeled data stream,I also propose two robust online semisupervised models.Finally,we not only empirically perform experiments on both static and stream data sets to demonstrate the high performance of all my proposed methods,but also theoretically analyse the regret bound of online algorithms. |