Font Size: a A A

Key Technologies Research On Data Stream Mining

Posted on:2010-07-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:F WuFull Text:PDF
GTID:1118360305973619Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer hardware, network communication and distributed computing technique, data stream, a new kind of data type, has been coming, which extensively exists in many research areas, such as network monitoring, financial analysis, sensor networks, weather or environmental monitoring and so on. On the other hand, the traditional data processing technology is suitable for dealing with static and stable datasets but not yet easy to be extended to data stream scenes in which the data stream is infinite, rapid, various and continuous. So, among the new kind of computation, it is one of the new difficulties in theory and application how to manage and analyze the data streams, especially, to detect in time the network anomaly through data stream mining.Based on the conclusion and analysis of the existing researches at home and abroad, this paper deeply and profoundly carries out studies which focus on the four key techniques in data stream mining, that is to say, similarity search, mining frequent patterns, classifying and arbitrary shape clustering over data streams. The main contributions are concluded as follows:1. In the aspect of the data stream similarity search, we research the no index structure lower bound function of Dynamic Time Warping (DTW). Using the concepts of subsection, fill-ins, and row & column constrains in mathematics, we construct a group of data stream similarity measure functions suitable for different scenes and the corresponding refining functions of the upper and lower bounds, and then we propose the similarity search algorithm on data streams. The theory analysis and statistical experiments confirm that both the constructed functions and the search algorithm have low computational complexity and high approximate degree, which has a good application perspective in the data stream similarity search.2. In the aspect of the data stream frequent pattern mining, considering the timeliness of data stream and the shift of the stream center, we propose an algorithm of data stream frequent pattern mining, named LWFPM, combining with both the landmark window and the time decaying model. By dynamically constructuring the global pattern tree, the method uses time exponential decay function to account the support of each pattern, within the landmark window, whose frequent degree is described. Moreover, to reduce the space cost effectively, the constructed pruning threshold function is used to cut in time the pattern that is not able to turn into the frequent pattern from the global tree. This paper deeply analyzes the important parameters and thresholds in the algorithm. The experimental results show that, comparing with the algorithm MSW, LWFPM has higher mining precision (over 90% in average) and lower storage cost, and it can reach the request of processing high-speed data streams in the execution speed. Our method can be applied to the streams with different number of transactions, average transaction sizes and maximal potentially frequent pattern sizes.3. In the aspect of the data stream classification, the traditional Back Propagation (BP) classification algorithm is difficult to meet the need for real-time response. Based on the kernel principal component analysis algorithm, through the study of incremental solving methods, we construct two different dimensionality-reduction algorithms to decrease the amount of calculation of classification. Furthermore, combined with the BP neural network, the corresponding data stream classification algorithm is raised, namely IKOC. The theory analysis and statistical experiments indicate the reduction algorithms have low space & time complexity and stable convergence performance, and IKOC can real-time process data streams and has high classification accuracy.4. In the aspect of the data stream arbitrary shape clustering, considering the timeliness and the concept shift of data stream, we propose an algorithm of arbitrary shape clustering called SWASC, combining with both the sliding window and the time decaying model. This algorithm applies the time decaying model to decay the density of the historic datum at the speed of exponential order so that the density of datum without the current sliding window is almost reduced to zero. By constructing a clustering feature structure with six elements, we get the decay density of micro-clusters within the landmark window, which is used to describe the degree of density of micro-cluster within the current sliding window. The pruning strategy is designed to timely cut down the sparse micro-cluster and the obsolete micro-cluster within and without the current window, respectively, which can decrease the storage cost and maintenance cost. Extensive experimental results indicate that, comparing with the algorithm DenStream, SWASC has higher clustering speed and lower memory cost, and it is fit for arbitrary shape clustering over data streams with different sizes, dimensions and numbers of natural clusters.
Keywords/Search Tags:data stream, similarity search, mining frequent patterns, classifying, arbitrary shape clustering, time decaying model
PDF Full Text Request
Related items