Font Size: a A A

Fractal Analysis Of Datasets Using Distributed Computing

Posted on:2017-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Iakovleva TatianaFull Text:PDF
GTID:2348330566956137Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
Clustering is a key issue in the advanced application.There are many existing clustering algorithms,but not all of them can work well with large datasets.If we apply clustering algorithms to databases the amount of dimensions can be too big.With the increasing dimensionality,the amount of data to be processed is greatly increasing and hence time costs are also increasing.This is a well-known problem which called curse of dimensionality.Dimensionality reduction is very important for data mining.First,I present a new algorithm of reducing dimensionality of a big dataset based on fractal theories and using Map Reduce.I will apply my algorithm to a big dataset to reduce its dimensionalityAs the second step I consider one of clustering algorithms to show how we can use the results obtained by applying my algorithm.One of the famous clustering algorithms is the Method Halite for Correlation Clustering.This algorithm is a fast and scalable density-based clustering algorithm for multi-dimensional data and it is able to analyze large data sets,which suggested Robson L.F.Cordeiro,Caetano Traina and Christos Faloutsos.In this paper I also propose two additions to the method: a)for determining specific densities in space as candidates for the center of a cluster to use Laplacian filters for the cells located on the edges and corners of the space with special values of the convolution Laplacian masks,as it used for example for processing photographic images.This will eliminate false triggering of the filter at the borders and corners of the space used.b)when analyzing the boundaries of a cluster,for the selection of the relevant axes of the cluster,when using method MDL(Minimum Description Length)do not use the suggested model of the function "by average of all axes",implying at some relationship between them,which maybe does not exist,but to use a discrete function,with a threshold equal to the probability of the density of an analyzed region.The main contribution of this work is the proposal of an approach to reduce the dimensionality of dataset,using the concept of fractal dimension and distributed computing,as well as interesting suggestions for improvement the grid-based clustering algorithm Halite.
Keywords/Search Tags:clustering, big datasets, dimensionality reduction, database, fractals, MDL, Laplacian mask
PDF Full Text Request
Related items