In recent years, with the rapid development of information technology, stream data in different ways appear in many applications, including network-traffic monitoring, financial applications, communication data management, network security monitoring and sensor networks. In these applications, it is obviously much important for making research on this new kind of stream data and related technology. Therefore, data mining on stream data becomes the research hotspot in the area of data mining.Researches on data stream mining mainly focus on modeling query, regression, clustering and classification. This paper analyses various existing algorithms on such tasks and presents more efficient algorithms on frequent itmes mining and data stream clustering. The main contributions of the paper are as follows:(1) Most of the exsiting alogorithms on mining requent stream data items do not emphasis the importance of the recent data items. The sliding window is an accepted and effective approach to mine frequent data itmes in the recent period of time. In this paper, we propose an algorithm for mining frequent items in a stream based on sliding window. This paper adopts linked queue strategy which greatly improves the efficiency of the algorithm. Given a threshold S, an error boundεand the length of the sliding window n, our algorithm can deterministly detect the data items within the current window whose frwquncy exceeds Sn with an error less thanεn using O(ε-1) memory space, and the processing time for each data item and the query time are both O(1). Based on this algorithm, we have proposed a general framework for mining frequent items in data stream based on sliding window. Under this framework, different algorithms can be constrcted by changing the parameters which could adjust the time complexity and space cost of the algorithm. Through extensive experiments, we show that our algorithm outperforms other methods in terms of the accuracy, memory requirement, and processing speed.(2)We introduce a time fading model which emphasises the recent data items and decreases the importance of the old ones. Based on this time fading model, FC1 and FC2 algorithms are proposed to detect theε-approxiate frequent items. The improved algorithm FC2 can detectε-approximate frequent items of data stream using O(ε-1) memory space and the processing time for each data item is O(1). Experimental results on several synthesis and real data sets show that FC2 has higher precision, require less memory and consume less computation time than other similar methods. Then, we propose a simpler but more effective algorithmλ-Count for computing frequency counts over a user specified threshold on data streams.λ-Count can detectε-approximate frequent items of a data stream using O( logλε) memory space and the processing time for each data item is O(1). Through extensive experiments, we show thatλ-Count outperforms other similar methods in terms of the accuracy, memory requirement, and processing speed.(3) Most of the existing stream data clustering methods such as CluStream are based on the k-means algorithm. A fundamental drawback of those k-means based methods is that it aims at identifying spherical clusters but is incapable of revealing clusters of arbitrary shapes. Also, those methods are unable to detect noise and outliers. Furthermore, they require the knowledge of k and a userspecified length of the time window. To address these issues, this paper proposes D-Stream, a framework for clustering stream data using a density-based approach. Then, DS0 and DS1 algorithms which are based on D-Stream are proposed. Exploiting the intricate relationships between the decay factor, attraction, data density and cluster structure, our algorithms can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped by outliers in order to significantly improve the space and time efficiency of the system. The experimental results show that our algorithm has superior quality and efficiency, and can accurately recognize the evolving behaviors of real-time data streams.(4) Previous works on clustering multiple data streams use Euclidean distances to measure the similarity between the streams. A fundamental limitation of Euclidean distance is that it only measures the distance but ignores the temporal trends and sequential patterns of data streams, which are often more important for users. A multiple streams clustering algorithm based on Kendall correlation analysis is proposed. We advance a novel compression scheme based on the AU-statistics, which quickly compresses the raw data from multiple streams into a compressed synopsis. The synopsis allows us to incrementally reconstruct the correlation coefficients without accessing the raw data. A modified k-means algorithm is developed to generate the clustering results and dynamically adjust the number of clusters in real time so as to detect the evolving changes in the data streams. Finally, we extend the framework to support clustering on demand (COD), where a user can query for clustering results over an arbitrary time horizon. A theoretically sound time-segment partitioning scheme is developed so that any demand time horizon can be fulfilled by a combination of those time-segments. Experimental results show that our algorithm has higher clustering quality, speed and stability than other methods, and can detect the evolving changes of the data streams in real time. |