Research On Key Technology Of Data Stream Processing

Posted on:2018-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:J L Liao

Full Text:PDF

GTID:2348330536979709

Subject:Information networks

Abstract/Summary:

PDF Full Text Request

A large number of applications have produced a steady flow of data in recent years,and people have gradually realized the importance of these data.Stream data mining technology has become a new research area,and data stream clustering analysis is an essential part of stream data mining.It has become a hotspot as well as a difficult problem in current research that how to have the data stream efficiently clustered so as to resolve the clustering problem of high-dimensional data streams and to adapt to large data sets.This thesis focuses on the clustering algorithm based on K-Means and the online data stream clustering algorithm.The traditional K-Means clustering algorithm randomly picks out k data objects from the data set as cluster centers,thus making the clustering results also random.In this thesis,an improved clustering algorithm is proposed,which determines appropriate cluster centers according to the data set characteristics.This algorithm avoids the randomness and blindness of the selection method in KMeans algorithm,and therefore solves the problem of randomness in clustering results.It also helps improve clustering process of traditional the K-Means.When the improved K-Means clustering algorithm is applied to the MapReduce framework,first of all,the whole data set is divided into several smaller data sets for clustering.This applies to the mapping phase in the MapReduce framework.Then,all the intermediate clustering results in mapping phases are collected to be used in the clustering of the reduction stage.Simulation experiments demonstrate that the proposed algorithm greatly enhances the performance of this type of clustering algorithms.This thesis presents a fuzzy data stream clustering algorithm based on TEDA(Typical and Eccentricity Data Analysis)model.TEDA model is often used in the detection of outliers data samples to obtain better clustering results.In order to meet the requirements of online fuzzy data stream clustering and real-time response,the algorithm inherits the concept of eccentricity and typicality and its correlation formula from TEDA algorithm to estimate whether a specified data sample belongs to certain specific cluster or clusters,which automatically update the entire cluster,and can deal with high-dimensional data streams.It can automatically create,update and merge data clusters,and there is no need to define parameters in advance.Compared with traditional clustering algorithm,this one does not need to store scanned data samples,has the advantage of high memory utilization and low computational cost,and it is more suitable for on-line real-time application due to its recursion mechanism.The simulation experiment results show that the proposed algorithm is superior to traditional algorithms as well as make satisfactory cluster analysis of actual data.

Keywords/Search Tags:

TEDA, eccentricity, typicality, cluster, K-Means, MapReduce framework

PDF Full Text Request

Related items

1	Research On Parallel Sampling K-Means Algorithm Based On MapReduce
2	The Design And Implementation Of A MapReduce Computing Framework Based On GPU Cluster
3	Improved K-means Clustering Algorithm Based On MapReduce Framework
4	Research On Calculation Method Of Vertex Eccentricity Over Large Graphs
5	Mapreduce and Heterogeneity: Power-Aware Bag-of-Tasks, Framework Parameter Sensitivity, and Dynamic Cluster Aware Framework Configuratio
6	Research On Parallelization Of Clustering Algorithm Based On MapReduce
7	Research And Implementation Of Expansibility Oriented Cluster Architecture
8	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
9	Research On K-Means Algorithm Based On MapReduce
10	Research On Accelerating Of K-means Clustering Algorithm Using FPGA Based On MapReduce