Research On Semi-Supervised Clustering Based On Transfer Learning And It’s Parallel Implementation

Posted on:2017-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2308330485485380

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Clustering algorithm is one of the important algorithms in machine learning, which automatically divide the similar objects into the same cluster. We can find out the hidden valuable information from the clustering results. Clustering analysis is an unsupervised method. Semi-supervised learning uses a small amount of known information to guide the clustering in order to improve the clustering performance.Transfer learning is a new machine learning method, which often uses in text classification, image recognition, etc. It uses existing knowledge to help and guide its related field problems.Distributed computing is a kind of methods in the cloud computing, which is used to calculate the mass of data. In recent years. Spark is very hot computing distributed computing framework that is widely used in machine learning and data mining.The traditional clustering algorithm is difficult to utilize the existing historical information, which tends to be less effective in the case of data being contaminated. The semi-supervised clustering algorithm is often used in such situation where the target data has some labeled examples. When the source data has partial labeled sample, this thesis proposes a semi-supervised fuzzy possibilistic C-Means algorithm(SS-FPCM). Base on the transfer learning framework, we present a transfer semi-supervised fuzzy possibilistic C-Means algorithm(TSS-FPCM) to avoid the negative transfer learning problem. Finally, in order to make full use of the information of the source data, the representative points are used to replace the class of source data. An improved transfer semi-supervised fuzzy possibilistic C-Means algorithm(ITSS-FPCM) is developed. The experimental results demonstrate that three algorithms may improve the clustering performance using source data effectively compared with other clustering algorithms. Moreover, the SS-FPCM and TSS-FPCM algorithms exploit partial labeled data from the source while ITSS-FPCM algorithm combines the labeled data and the "representative points", so the latter gets a better result when few target data sample or tained data.The data collected is constantly growing. When the source data set becomes more and more larger, the traditional algorithm is difficult to quickly find the information what we need from vast amounts of data. To counter the problem of large data processing, we design and experiment a distributed semi-supervised fuzzy possibilistic clustering algorithm (D-SS-FPCM) to deal with massive amounts of data clustering problem on Spark platform. Experiments show that D-SS-FPCM algorithm has a good performance in Speedup, Sizeup and Scaleup indicators, and it shows that the algorithm has good parallel efficiency and scalability.

Keywords/Search Tags:

clustering, semi-supervised, transfer learning, distribute computing, Spark

PDF Full Text Request

Related items

1	Research On Parallel Implementation Of Semi-Supervised Clustering
2	Research On Semi-supervised Clustering And Classification Algorithm
3	Research On Recommendation Algorithm Based On Semi-supervised AP Clustering And Adaptive Transfer Clustering
4	Multiple Kernel Learning Improved By Bi-objective Functions And Its Application To Semi-supervised Learning And Transfer Learning
5	Applications Research Of Supervised Intelligent Clustering And Classification Technologies
6	Semi Supervised Clustering Algorithm And Its Application And Research
7	Semi-supervised Learning On Text Data
8	Research On Transfer Learning Algorithm Based On Semi-supervised Tri-training
9	Research On A Semi-supervised Random Forest Classification Algorithm And Its Parallelization
10	Research On Two Clustering Algorithms Based On Semi-Supervised Learning