Font Size: a A A

Research On Weighted Sampling Method In Large Scale Data Clustering

Posted on:2015-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2298330422482033Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
This is a comprehensive information age. Information exists in every areas of daily life.As one of the important field of data mining, clustering is an unsupervised tool. Depending onthe properties, data objects are classified into clusters. Data which in the same cluster aresimilar and the data from different clusters are not similar. Clustering meets the requirementof data processing, and contributes more to people’s work.However, with the development of the information storage and processing technology inthe past decade, the information explosive brings traditional clustering algorithm somedifficulties. Firstly, with the diversification of the data, the form of the data becomes into thetext, audio, video instead of numeric so that the quantify of the data becomes difficult;Meanwhile, since the number of property is often hundred or thousand now, the structure ofthe datasets is non-linear characteristics, which makes the division clustering method nolonger feasible. More importantly, the clustering process on the large dataset requires morestorage space and time-consuming, which makes the price-performance ratio of traditionalclustering algorithm lower. This paper focuses on the problem of time-consuming. Clusteringalgorithm is proposed based on the weighted sampling, which reduces the time-consumingwhile ensure the quality of the clustering result.Parallel computing and reduction of the dataset are two solutions for the problem oftime-consuming. By sampling the smaller subset, obviously, clustering on this subset cansignificantly reduce the time-consuming. At the same time, if the data object in this datasetcan be guaranteed more important so that they can make the cluster result similar with theresult of the original dataset, the algorithm will reach the goal.This paper defines these more important data, and assigns weight to data in the originaldataset, then designs several weight change methods which can increase the weight of suchmore important data so that they have higher probability to get into sampling subset.Experiments show that the clustering on this subset, can reduce the time-consuming whileensure the quality of the clustering result.
Keywords/Search Tags:Clustering algorithm, the large data, weighted sampling
PDF Full Text Request
Related items