Research On Weighted Sampling Method In Large Scale Data Clustering

Posted on:2015-08-19

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2298330422482033

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

This is a comprehensive information age. Information exists in every areas of daily life.As one of the important field of data mining, clustering is an unsupervised tool. Depending onthe properties, data objects are classified into clusters. Data which in the same cluster aresimilar and the data from different clusters are not similar. Clustering meets the requirementof data processing, and contributes more to peopleâ€™s work.However, with the development of the information storage and processing technology inthe past decade, the information explosive brings traditional clustering algorithm somedifficulties. Firstly, with the diversification of the data, the form of the data becomes into thetext, audio, video instead of numeric so that the quantify of the data becomes difficult;Meanwhile, since the number of property is often hundred or thousand now, the structure ofthe datasets is non-linear characteristics, which makes the division clustering method nolonger feasible. More importantly, the clustering process on the large dataset requires morestorage space and time-consuming, which makes the price-performance ratio of traditionalclustering algorithm lower. This paper focuses on the problem of time-consuming. Clusteringalgorithm is proposed based on the weighted sampling, which reduces the time-consumingwhile ensure the quality of the clustering result.Parallel computing and reduction of the dataset are two solutions for the problem oftime-consuming. By sampling the smaller subset, obviously, clustering on this subset cansignificantly reduce the time-consuming. At the same time, if the data object in this datasetcan be guaranteed more important so that they can make the cluster result similar with theresult of the original dataset, the algorithm will reach the goal.This paper defines these more important data, and assigns weight to data in the originaldataset, then designs several weight change methods which can increase the weight of suchmore important data so that they have higher probability to get into sampling subset.Experiments show that the clustering on this subset, can reduce the time-consuming whileensure the quality of the clustering result.

Keywords/Search Tags:

Clustering algorithm, the large data, weighted sampling

PDF Full Text Request

Related items

1	The Research On Sampling For Data Mining
2	Study On Graph Sampling Algorithm For Graph Clustering Characteristic
3	Research On Spectral Clustering Algorithm Based On Weighted Ensemble Nystr?m Sampling
4	Application Of Clustering Based Sampling Algorithms In Unbalanced Data Learning
5	Research On Multi-view Multiple Clustering Algorithm Based On Sampling And Inverse Optimization
6	Research On Clustering Algorithms For Large-scale Complex Data
7	Research And Application Of Clustering Algorithm Based On Bigdata
8	Research On The Large Scale Clustering And Its Applications On Anomaly Detection
9	Application And Research On Clustering Algorithm In Large Scale Datasets
10	Application And Research On Clustering Algorithm In Large Scale High Dimensional Datasets