Font Size: a A A

Research On Key Technology Of Data Stream Processing

Posted on:2018-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiaoFull Text:PDF
GTID:2348330536979709Subject:Information networks
Abstract/Summary:PDF Full Text Request
A large number of applications have produced a steady flow of data in recent years,and people have gradually realized the importance of these data.Stream data mining technology has become a new research area,and data stream clustering analysis is an essential part of stream data mining.It has become a hotspot as well as a difficult problem in current research that how to have the data stream efficiently clustered so as to resolve the clustering problem of high-dimensional data streams and to adapt to large data sets.This thesis focuses on the clustering algorithm based on K-Means and the online data stream clustering algorithm.The traditional K-Means clustering algorithm randomly picks out k data objects from the data set as cluster centers,thus making the clustering results also random.In this thesis,an improved clustering algorithm is proposed,which determines appropriate cluster centers according to the data set characteristics.This algorithm avoids the randomness and blindness of the selection method in KMeans algorithm,and therefore solves the problem of randomness in clustering results.It also helps improve clustering process of traditional the K-Means.When the improved K-Means clustering algorithm is applied to the MapReduce framework,first of all,the whole data set is divided into several smaller data sets for clustering.This applies to the mapping phase in the MapReduce framework.Then,all the intermediate clustering results in mapping phases are collected to be used in the clustering of the reduction stage.Simulation experiments demonstrate that the proposed algorithm greatly enhances the performance of this type of clustering algorithms.This thesis presents a fuzzy data stream clustering algorithm based on TEDA(Typical and Eccentricity Data Analysis)model.TEDA model is often used in the detection of outliers data samples to obtain better clustering results.In order to meet the requirements of online fuzzy data stream clustering and real-time response,the algorithm inherits the concept of eccentricity and typicality and its correlation formula from TEDA algorithm to estimate whether a specified data sample belongs to certain specific cluster or clusters,which automatically update the entire cluster,and can deal with high-dimensional data streams.It can automatically create,update and merge data clusters,and there is no need to define parameters in advance.Compared with traditional clustering algorithm,this one does not need to store scanned data samples,has the advantage of high memory utilization and low computational cost,and it is more suitable for on-line real-time application due to its recursion mechanism.The simulation experiment results show that the proposed algorithm is superior to traditional algorithms as well as make satisfactory cluster analysis of actual data.
Keywords/Search Tags:TEDA, eccentricity, typicality, cluster, K-Means, MapReduce framework
PDF Full Text Request
Related items