Font Size: a A A

The Research On Clustering Of Mixed Data Stream Based On DPC Algorithm

Posted on:2019-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZengFull Text:PDF
GTID:2417330566986683Subject:Statistics
Abstract/Summary:PDF Full Text Request
Cluster analysis is an important research topic in data mining.With the arrival of the era of big data,data stream has been applied in many fields.Data stream clustering has also become a far-reaching and challenging technology.Compared with traditional static data,data stream is high-speed,dynamic,and changeable.These characteristics of data stream have brought difficulties to the clustering of data stream.In addition,the characteristics of high-dimensionality,mixed attributes,and massiveness of the data stream impose higher requirements on the clustering of data stream.This paper will focus on the above issues and propose a data stream clustering algorithm that can not only adapt to the characteristics of data stream but also effectively handle the high dimensionality,mixed attributes,and massiveness of the data stream.Four main contents are includes in this paper.Firstly,discuss related issues of data stream clustering analysis,summarize data stream characteristics,introduce stream processing models and the clustering methods of data stream.Secondly,introduce the pretreatment method of data stream,discuss the data stream standardization,data stream reduction and data stream mixed attribute processing method respectively.Thirdly,the DPC algorithm has three shortcomings: can't handle mixed attribute data,the selection of truncation distance affects the calculation of density and can't process the large-scale data.This paper proposes improve algorithm based on a mixed data processing method based on information entropy,a KNN non-parametric kernel density estimation method and sliding window technology to implement cluster analysis of mixed attribute data stream.Fourthly,the DPC improved algorithm was used to cluster the KDDCup99 dataset,and compared with the Clustream algorithm and Denstream algorithm to evaluate the clustering effect of the improved DPC algorithm.In order to test the utility of the improved DPC algorithm,cluster analysis on census income dataset and bank marketing dataset and design a control experiment proving the effectiveness of the DPC algorithm density improvement.The analysis results of the KDDCup99 dataset show that the DPC improved algorithm can detect arbitrary shape datasets and maintain high clustering accuracy.Compared with the Clustream algorithm and Denstream algorithm,the DPC improved algorithm has significantly improved the clustering accuracy and Better stability.In the utility test process,the DPC improved algorithm maintains a high clustering accuracy in the cluster analysis of Census Income Dataset and Bank Marketing Dataset,and the control experiment result also verify the effectiveness of the DPC algorithm density improvement.The main contribution of this paper can be as follow.Firstly,improve the DPC algorithm from three aspects to make the DPC algorithm suitable for cluster analysis of high-dimensional mixed attribute data stream.Secondly,put forward a feasible and effective clustering method for high-dimensional mixed attribute data stream,which is applicable for the clustering of data in the fields of network security,social science and economics.Thirdly,realize the application of the DPC improved algorithm through Matlab program and promotes the application of Matlab software in data stream clustering.
Keywords/Search Tags:Density Peaks Clustering Algorithm, Data Stream Clustering, Mixed Data, K-nearest Neighbors, Information Entropy
PDF Full Text Request
Related items