Font Size: a A A

Analysis On Evolving Clustering For Categorical Data Stream

Posted on:2016-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:1108330482950509Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
Clustering learning is an important research direction in the field of machine learning, and, due to the ability of automatically detecting cluster structure, has extensive applications not only in various research domains, including image processing, bioinformatics, Chinese information processing, social network, and intelligent medical, but also in data mining of production practice and social management from all trades and profes-sions. Currently clustering models and algorithms for static data have been studied in the literature. However, since some fields, such as stock trading, real-time monitoring, electronic commerce and social media, are producing orderly, quickly, scale and potential infinite data stream, many practical applications have the needs of mining cluster structure of data stream, capturing patterns change of cluster structure and detecting ab-normal data of data stream. In view of the characteristic of data stream, such as orderly, quickly, scale and potential infinite, how to quickly and accurately find out the cluster pattern, concept drift and pattern evolv-ing hidden in the data stream using limited storage space has become an important research content of clustering learning. Clustering on categor-ical data stream is a relatively new research field relative to numerical data stream, and one of the main difficulties in categorical data analysis is to measure the similarity or dissimilarity on data by an appropriate way. Since the corresponding clustering models and algorithms are differ-ent from those of the numerical data stream, in recent years, clustering analysis for categorical data stream has attracted extensive attention of researchers.This thesis mainly concentrates on categorical data stream and aims to built clustering analysis framework according to the characteristics of data stream. We have systematically studied the models and algorithms concerning data labeling, concept drift detection, data stream evolving and outlier detection, and the main contributions are summarized as fol-lows:(1) We propose a categorical data labeling method based on incre-mental entropy. The method used incremental entropy to measure the change degree of cluster structure caused by putting an unlabeled data point into some different clusters and characterize the similarity between a data point and a data cluster. The proposed "point-cluster" similari-ty measure overcomes the shortcomings of the traditional "point-cluster" similarity measure that is based on attribute value distribution and can-not discovers the change degree of cluster structure caused by putting a data point into some different clusters. The method can also adjust the outlier threshold adaptively and dynamically in the process of data labeling. Comparative experiments on categorical data stream and incre-mental categorical data show that the proposed method can improve the accuracy of data labeling, and lay a foundation for improving the accuracy of categorical data stream clustering.(2) We proposed a concept drift detection method for categorical data stream based on cluster distribution similarity measure. We first de-fine a similarity measure of two cluster distributions based on the sample standard deviation and present the density function approximate solu-tion method of the cluster distribution similarity measurement. We then put forward to a threshold determination method of cluster distribution change based on the confidence level. The proposed concept drift detec-tion algorithm can detect concept drift caused by overmany outliers in new window or an obvious change of cluster distribution between new window and old window. Experimental results show that the proposed method can effectively detect concept drift in the process of categorical data stream clustering.(3) We present a "cluster-cluster" dissimilarity measure based on in-cremental entropy to depict the similarity of two clusters by measuring the change degree of information entropy caused by putting a cluster in- to some different clusters. The proposed similarity measure overcomes the shortcomings of existing "cluster-cluster" similarity measure that is based on attribute value distribution and cannot dynamically capture the cluster structure change caused by mixing data of different clusters. In addition, we propose a cluster representative definition by taking into ac-count attribute values distribution of a cluster in this cluster and in other clusters, and present a categorical data stream evolving analysis algorithm that can intuitively show cluster pattern evolving process in data stream.(4) We present a network intrusion detection method based on cat-egorical data stream evolving clustering by combining misuse detection model and anomaly detection model. A knowledge base is established by the initial clustering made up of normal patterns and abnormal pat-terns. When network access data stream evolves, the knowledge base is re-clustered to reflect the state of network access. The similarity between network access data and normal pattern and abnormal pattern is mea-sured using incremental entropy, and the legitimacy of network access data is determined. The whole process of learning and detection scan-s network access data only once. Experiments show that the presented method is more advantageous in real-time performance and adaptability.The research work of this thesis has further enriched the research achievements in the field of categorical data stream analysis, and provides technology support for stream data mining and knowledge discovery.
Keywords/Search Tags:categorical data stream, clustering analysis, data labeling, concept drift detection, data stream evolving analysis, outlier detection
PDF Full Text Request
Related items