Font Size: a A A

Data Stream Processing Algorithm Based On Cluster Analysis

Posted on:2014-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:R H WangFull Text:PDF
GTID:2268330401987281Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, large amounts of data need tobe processed with a transient, real-time and infinite features in the more and moreactual projects, this kind of data is called data stream. These data first appeared in thetraditional banking and stock trading, now it also appears in many research areas suchas in the geological survey, meteorological, astronomical observation. Especially in thenetwork, which network traffic monitoring or click stream, and wirelesscommunication network, such as call records. The majority of these data with acharacteristic of high-dimension, therefore, these data can produce large amounts ofdetailed data continuously and automatically. In these fields, usually can process thesedata intricately in the data warehouse in an offline state, and this analytical processingincluding trend analysis, forecast, etc. However, some new applications are verysensitive to time and need to be online analysis, especially in the network security andthe national security field, such as the Internet fraud detection, intrusion, abnormal,complex crowd control, trend monitoring, exploratory analysis, and exploratoryanalysis, etc. So it usually requires complex nearly real-time analysis of this kind ofdata.Analysis of the data flow mainly includes three aspects of frequent patternmining, classification and clustering, among which use some new methods andtechnology, such as the sliding window technique. This paper presents a new flowprocessing algorithm based on data clustering analysis, which based on projection andfitting, that is HpFitStream. The algorithm analyzes the data flow of bridge healthmonitoring by introducing the data stream and data stream clustering algorithm. Thealgorithm combines the sliding window technology with the fitting algorithm topreprocesses data, and stores the statistical characteristics of the data stream which isclustered in the summary data structure. Using the correlation analysis of the statisticalanalysis theory to analyze data points of the data stream, grasp and understand thecharacteristics and trends of data stream, thus effectively analyze the status of monitored objects, as the objects being monitored in the event of a serious anomaly,carries on early warning, as well as the maintenance, repair and management.This paper mainly focuses on the following aspects:①Outline the concept of data stream, from the processing function of data stream,introduces the current existing data stream processing model, including the slidingwindow model, the landmark model and the snapshot model, and summarizing the dataflow model of the advantages and disadvantages. To research the purpose and thecontent as the starting point, the choice of the topic is the data stream processing modelbased on sliding window, in order to ensure the reliability and stability of processingdata and analysis feasibility.②Introduce the concept of data stream clustering analysis, and the classicclustering algorithm of data stream. Currently, most of the current clustering algorithmin clustering analysis is for the low dimensional data stream, but most of data needs toprocessed are high dimension; this paper also introduces the clustering algorithm forhigh dimensional data stream. According to the multidimensional nature of bridge healthmonitoring data, based on the classical clustering algorithm of data stream, this paperproposes an algorithm to improve the data stream clustering, high dimensional dataclustering algorithm based on feature projection and fitting (HpFitStream), the realization oflarge, dynamic, high dimensional data stream clustering. The algorithm based on slidingwindow technique, the eigenvector projection to achieve dimensionality reduction of highdimensional data stream, and the polynomial fitting algorithm to preprocess the abnormaldata in the original data..③Based on the clustering results, proposed a new method for analyzing data flowtrend, the trend of analysis of data stream based on sliding window, this method uses thesliding window algorithm for real-time segmentation of data flow, polynomial fitting usingleast square nonlinear on the sliding window of data flow and the prediction analysis. Thetrend of analysis of data flow in the application of least square method, if the data flow isnot abnormal, can accord the fitting algorithm to predict the data of different periods, thetrend of the development of more detailed observation data. The experiment results showthat, the clustering algorithm can compress data, save memory space, but also can greatlyshorten the data processing time, improve the quality of clustering. And the trend analysismethod, but also improve the speed of data processing greatly. The experimental results show that, this clustering algorithm not only cancompress data, save memory space, but also can greatly shorten the data processingtime, improve the quality of clustering. And the method of segmented trend analysisalso improves the speed of data processing greatly.
Keywords/Search Tags:data stream, Data stream clustering, feature vector, polynomial fitting, trend analysis, least square method, sliding window technology
PDF Full Text Request
Related items