Font Size: a A A

Research On Frequent Items Mining And Clustering Algorithms Of Data Stream

Posted on:2009-11-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y WangFull Text:PDF
GTID:1118360272958842Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A growing number of applications appeared with the rapid development of scientific technology and the wide application of information technology, where data takes form of "continuous data streams " , which means data arrives continuously, orderly and in real time, the application areas include network-traffic monitoring, computer-network security, financial applications, environmental monitoring and log records analysis, and many more. This sequence of elements arriving continuously and orderly is called data stream. Different from traditional database, data stream has the following distinguished characteristics: unbounded volume of data; being processed only once, unless being reserved; rapid arriving rate of records; uncontrollability of records' arriving order. Analysis of and mining of data stream have become hot subjects of research.Data stream mining is to extract useful information and knowledge which embed in the streaming data and previously unknown to users. The main function of streaming data mining is to find association rules, classification rules, cluster structure, abnormity ,etc through mining the frequent items(itemsets), classifying, clustering, abnormal detection, etc. For example, we can perform accounting based on usage, network-traffic monitoring, computer-network security monitoring by applying the technology of mining frequent items from data stream; We can perform computer-network intrusion attack monitoring, topic detection and tracking of newswire, monitoring the traffic and environment, dividing the customer of big companies into different groups, detecting the financial cheating, etc by applying the technology of clustering data streams.It is impossible to store all the data in the data stream due to the limited memory and unbounded streaming data. therefore, the algorithms dealing with the data stream only store the synopsis of data stream which is updated with the data arriving in the stream. In the data stream computation model, the algorithms can only give the approximate results upon the users' request by using the stream synopsis stored.Our main work focus on mining frequent items and the clustering of data stream, since they have great importance in network flow analysis. The data of network flow and many other application fields contains different types of attributes besides numeric attributes, there are also categorical and other types of attributes. In addition, the data of many application can be of high dimensionality which may be as high as several dozen or even several hundred. Therefore, it is of great importance both in theoretical and in practical to perform research on the clustering algorithms of high dimensional data stream and the clustering algorithms of data stream where data contains mixed attributes. On the other hand, it has also been a hot subject of research to develop algorithms and systems based on the biological system, and the research has made distinguished progress, as one of them, artificial immune system(AIS) has the combined advantages of classifier system, neural network and machine reasoning etc, and it can provide novel way of solving problems. There have been a few researches in the applying AIS to data stream clustering, but there are problems, therefore, we also focus on providing new AIS-based clustering algorithm aiming at solving the problems in the currently AlS-based stream clustering algorithm. Our primary works are as follows:(1) Research on Algorithms of Mining Frequent Items in Data StreamWe introduce a memory efficient data structure based on Bloom Filter, which we call ESBF(Extensible and Scalable Bloom Filter), ESBF can be used for store information of large dataset and can provide efficient query. Based on ESBF, the algorithm for miningfrequent items in data stream based on landmark window model——FI-ESBFL is proposed,and it is proved that FI-ESBFL can achieve the same precision and probability as other Bloom Filter based algorithm dealing with stream frequent item mining by using less amount of counters. The memory used by FI-ESBFL can be adjusted dynamically according to the different data distribution and the number of distinct items in the data streams, which contribute to memory saving. Experimental results demonstrate the efficiency and effectiveness of FI-ESBFL. We also propose the new algorithms of mining frequent items in data stream based on damped window model(FI-ESBFD) and sliding window model(FIS-EBFS) respectively, FI-ESBFD and FIS-EBFS are all based on FI-ESBFL. FIS-EBFSD is more efficient in time and memory comsumption than the compared algorithms based on Bloom Filter and damped window model in most cases. FIS-EBFS is much more efficient in time consuming than the compared algorithm based on sliding window model.(2) Research on Clustering Algorithms of Data Stream with Mixed AttributesTwo different similarity measures of data objects with mixed attributes based on entropy are introduced, then, we propose two different algorithms of clustering the data stream with mixed attributes based on the two similarity measures respectively, they are CNCE-Stream and CNCDE-Stream. CNCDE-Stream define the similarity between the data objects using both Euclidean Distance and entropy, while CNCE-Stream only use entropy as similarity measure. Before proposing algorithm CNCE-Stream, we introduce a new way of estimate the pdf(probability density function) in the stream scenario, and then the way of calculating the expect entropy of a cluster with mixed attributes is introduced, the incremental entropy and merging entropy are introduced to make CNCE-Stream work. Experimental results demonstrate that both CNCDE-Stream and CNCE-Stream can work with high precision, and CNCDE-Stream is very efficient in time consuming. (3) Research on Subspace Clustering Algorithm of High Dimensional Data StreamWe introduce a new subspace clustering algorithm of high dimensional datastream(SOStream) to address the issue of that most clustering algorithm of data stream can work well only when data is of low dimensionality and of the disadvantages of the existing subspace stream clustering algorithms. SOStream is based on grid and density and it maintains a superset of all dense units in an online way, we introduce the delayed insertion of potential dense units and the pruning of sparse units to make the algorithm work more efficiently. SOStream will generate the final cluster using the maintained dense units when required. The experiments conducted show the effectiveness of the proposed algorithm.(4) Research on Clustering Algorithm of Data Stream Based on Artificial Immune Network(AIN)The immune principle based learning can adapt to the dynamic environment easily, So a new AIS based data stream clustering algorithm(AIN-Stream) is proposed. During the process of clustering, the simulation level of a B-cell is determined by the antigen(the data in the stream). By creating and maintenance of the B-Cell Feature vectors and by means of statistical analysis, AIN-Stream is capable of automatically determining themost important parameter of AIS based clustering algorithm——Recognization Zone ofB-Cells. B-Cell Feature vectors also make it possible to eliminate the redundant B-cells more effectively, which also contribute to the improved efficiency of the algorithm. AIN-Stream can determine the number of final clusters automatically, which makes the real unsupervised clustering. Experimental results demonstrate that AIN-Stream can track the evolving of data steam and has high clustering quality, experiments conducted also show that AIN-Stream is superior in memory and processing time consumption than other immune principle based clustering algorithms under the circumstance of similar clustering results.In conclusion, our methods are great complementarity and improvement to existing ones which deal with data stream frequent items mining and data stream clustering. Theoritical analysis and experimental results show that our methods can solve the corresponding problems effectively and efficiently.
Keywords/Search Tags:Data Stream, Data Mining, Window Model, Frequent Items, Clustering, Mixed Attributes, Subspace, AIS
PDF Full Text Request
Related items