Font Size: a A A

Research And Application Of Semi-Supervised Learning Algorithms Based On "Collaborative-Participatory" Computational Cognition Model

Posted on:2010-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:C DengFull Text:PDF
GTID:1118360302465557Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
Machine learning is one of the core topics in the data mining and pattern recognition. With the development of internet and information technology, huge amounts of data has been rapidly accumulated. The supervised learning needs large number of labeled examples acting as training set. However, in practical applications such as computer aided diagnosis (CAD) of medical images, unlabeled data are readily available but labeled data are fairly expensive to obtain because they require human effort. Unsupervised learning can perform auto-learning without any supervision, but the learned hypothesis is not precise enough. Therefore, the semi-supervised learning (SSL) that combines large amount of unlabled data with limited number of labeled ones becomes a hot topic.Existing SSL algorithms attempt to exploit the additional information provided by the large amount of unlabeled data to guide the learning process, and enhance the final performance. However, At present, there is one urgent challenge, i.e. the safe-usage of unlabeled data, since the misleading information among the additional information is inevitable, and may degenerate the performance. In addition, in practical large-scale learning tasks, the memory-resident and serial-implementation in exisiting algorithms result in the bottlenecks of memory and efficiency for loading and computing the large-scale unlabeled data. Thus, its application is restricted.With respect to the first challenge, this paper builds a computational cognition model according to the semi-supervised characters in human learning, and then, based on the computaional model, this paper focuses on the methods of identiying the misleading-information and helpful-information by an adaptive way. With respect to the second challenge, this paper investigates the parallization of semi-supervised learning algorithms in commodity PC cluster using the MapReduce paradigm, which is a successful parallel technique based on the data-splits idea. The major innovative contributions are as follows:(1) Propose a computational cognition model to improve the safety of SSL According to the cognition psychology, the computation model of collaborative learning is built to capture the character of team-work in human. Further, the participatory learning compuation model is integrated, and a novel computation model named "collaborative-participatory learning" is proposed to capture the semi-supervised characters in human learning. The new model consists of shared-knowledge-poor, acceptance unit and critic unit. The updation and usage ways of shared-knowledge-pool determine the stratigies of mining and using the additional information from unlabeled data; the acceptance unit filters the additional information feteched from the shared-knowledge-pool; critic unit uses arousal mechanism to precisly measure the effectiveness of acceptance unit, and inhibit the invalid, wrong discrimination to the addtional information. This paper uses this computation model to dissect the co-training-style SSLs, and obtains the strategy to improve the safety of SSL: The co-training-style SSL might incorporate the acceptance unit and critic unit to identify and filter the misleading information, when it gets the additional information from shared-knowledge-poor.(2) Propose new type of semi-supervised clustering algorithms based on "collaborative-participatory" computation cognition modelSemi-supervised clustering algorithms often utilize a seeds set consisting of a small amount of labeled data to initialize cluster centroids, hence improve the clustering performance over whole data set. Both the scale and quality of seeds set directly restrict the performance of semi-supervised clustering algorithm. Based on the "collaborative-participatory" model, this paper proposed a new type of clustering algorithms, which could auto-obtain a seeds set of large-sacle and low noise. First, the tri-training iteration is used as the mechanism for shared-knowledge-pool, and the nearest neighbor rule based data editing technique Depuration is used as the acceptance mechanism. Then, prior to using the seeds set to initialize cluster centroids, the new algorihtm uses tri-training process to label unlabeled data and adds them into the initial seeds set to enlarge the scale. Meanwhile, to improve the quality of enlarged seeds set, the Depuration in acceptance unit is used to eliminate and correct the mislabeled noise data in the enlarged seeds set. Experiments show that the novel algorithm could effectively improve the cluster centroids initialization and clustering performance.(3) Propose new co-training-style semi-supervised classification algorithms based on "collaborative-participatory" computation cognition modelEnsemble-like co-training-style SSL algorithms train N(N>2) classifiers on the initial labeled data, and then uses the ensemble of N-1 classifiers to label the unlabled data, to provide additional information for retraining the Nth classifier. However, because initial limited number of labeled data is not sufficient to obtaining accurate clssifiers, it is inevitable that there are many mislabeled data in the additional information and become misleading information. In order to improve the usage safety of unlabeled data, this paper adaptively identifies and remove the mislabeled data through incorporating the acceptance unit and critic unit into the co-training iteration, so that the generalization ability of final hypothesis is ensured under various cases. In detail, the acceptance unit is defined as RemoveOnly data editing, and certain arousal strategy in critic unit is used to measure the positive and negative effects of RemoveOnly, and adaptively controll the activation of RemoveOnly. With respect to two typical algorithms, i.e. the three classifiers based Tri-training and multi decision trees based Co-Forest, this paper proposed corresponding novel adaptive data editing based ones. Experiments show that two novel algorithms have better generalization ability and stability than original ones.(4) Proove a series of thereoms on improving the generalization ablitity for adaptive data editing strategyThe adaptive strategy is combinations of a serial of precondition theorems all that will ensure reducing classification error as well as increasing the scale of new training set iteratively under the PAC theory. This paper also provide the proof of all these precondition theorems.(5) Propose MapReduce paradigm based parallel SSL algorithms for large-scale learning taskFor the bottlenecks of memory and computing, this thesis utilizes MapReduce paradigm to ensure that the original serial-mode SSL algorithms could be implemented by parallel-exact-mode on the clusters of commodity PCs. In detail, accordign to the idea of dividing-parallel conquer-summarizing, the high-throughput computations in the SSL process are adapted to the parallel map functions and reduce functions. In theory, the relative running-time scaling of MapReduce based algorithms could be linear ratio against the linear increment of PC nodes. The practical application to lung nodules detection in CT medical images show that the relative running-time scaling of MapReduce based SSL algorithms is close to linear.
Keywords/Search Tags:Semi-supervised learning, Co-training, Computational cognition model, PAC learning theory, MapReduce paradigm
PDF Full Text Request
Related items