Font Size: a A A

Research On Semi-Supervised Clustering Based On Transfer Learning And It’s Parallel Implementation

Posted on:2017-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2308330485485380Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Clustering algorithm is one of the important algorithms in machine learning, which automatically divide the similar objects into the same cluster. We can find out the hidden valuable information from the clustering results. Clustering analysis is an unsupervised method. Semi-supervised learning uses a small amount of known information to guide the clustering in order to improve the clustering performance.Transfer learning is a new machine learning method, which often uses in text classification, image recognition, etc. It uses existing knowledge to help and guide its related field problems.Distributed computing is a kind of methods in the cloud computing, which is used to calculate the mass of data. In recent years. Spark is very hot computing distributed computing framework that is widely used in machine learning and data mining.The traditional clustering algorithm is difficult to utilize the existing historical information, which tends to be less effective in the case of data being contaminated. The semi-supervised clustering algorithm is often used in such situation where the target data has some labeled examples. When the source data has partial labeled sample, this thesis proposes a semi-supervised fuzzy possibilistic C-Means algorithm(SS-FPCM). Base on the transfer learning framework, we present a transfer semi-supervised fuzzy possibilistic C-Means algorithm(TSS-FPCM) to avoid the negative transfer learning problem. Finally, in order to make full use of the information of the source data, the representative points are used to replace the class of source data. An improved transfer semi-supervised fuzzy possibilistic C-Means algorithm(ITSS-FPCM) is developed. The experimental results demonstrate that three algorithms may improve the clustering performance using source data effectively compared with other clustering algorithms. Moreover, the SS-FPCM and TSS-FPCM algorithms exploit partial labeled data from the source while ITSS-FPCM algorithm combines the labeled data and the "representative points", so the latter gets a better result when few target data sample or tained data.The data collected is constantly growing. When the source data set becomes more and more larger, the traditional algorithm is difficult to quickly find the information what we need from vast amounts of data. To counter the problem of large data processing, we design and experiment a distributed semi-supervised fuzzy possibilistic clustering algorithm (D-SS-FPCM) to deal with massive amounts of data clustering problem on Spark platform. Experiments show that D-SS-FPCM algorithm has a good performance in Speedup, Sizeup and Scaleup indicators, and it shows that the algorithm has good parallel efficiency and scalability.
Keywords/Search Tags:clustering, semi-supervised, transfer learning, distribute computing, Spark
PDF Full Text Request
Related items