Font Size: a A A

Research On Key Techniques For Evolving Data Stream Mining

Posted on:2015-12-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Muhammad Zia-ur-RehmanFull Text:PDF
GTID:1228330461974355Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Continuous growth in the data size, generated by real world computer applications, impos-es new challenges for online data handling and extraction of useful knowledge. Applications including homeland security applications, pervasive computational applications, industrial qual-ity monitoring systems, telecommunication and computer networks require online monitoring and analysis. In these applications, tremendous amount of data is produced continuously at an unprecedented rate by single or multiple sources referred as a data stream. Because of the un-interrupted, continuously changing data distribution, fluctuating data arrival rate and ceaseless nature of the data stream, the amount of data is likely to be unbounded and volatile. A timely detection of interesting patterns over incoming data is critical for in-time decision making and response to emerging changes produced within the data.First, a detailed survey of the existing scientific literature in the stream mining field is per-formed which suggests that most commonly studied problem in this domain is related to clas-sification. Decision tree algorithms are the one of the first classification algorithms that were adopted to the data stream. One of the first decision tree learners for data stream relies on the Hoeffding bound as a splitting criterion for the splitting of the nodes into two leaves. Hoeffding bound is attractive due to its property of being independent of the probability distribution gener-ating the observation and converges with a larger value of the denominator (number of objects seen so far). In literature, there are concentration inequalities other than Hoeffding inequali-ty which are stricter as compared to it. Bernstein’s and Benett’s inequalities are distribution dependent because they involve the variance term. In this dissertation, the Bernstein’s bound instead of Hoeffding bound is used to design a distribution aware decision tree learner (EBT) and achieve results with better accuracy and robustness with less computational cost.Most of the existing classification techniques believe that there are enough labeled data available for training of a classifier, which is not always true, especially for emerging scientific and engineering applications which generate evolving data at a huge speed. Since data streams are inherently boundless in length and new concepts evolve all the time. Therefore, they expe-rience from lack of true labels. It is impossible for domain experts to label all of the data stream manually. A wrapper technique is proposed, based on grid clustering to perform labeling of the incoming training data. The experimental evaluation suggests that the proposed algorith- m (LLSC) can attain the same classification accuracy as VFDT for 50% labels. However, its computational time is very high. LLSC uses the intersection of uni-dimensional gridCells to identify a particular cell. This step is very expensive. To improve its efficiency, a modified version LLSC+1 is proposed, which can identify the gridCell using a hash table in a single step. Furthermore, a semi-supervised clustering technique to partition the data into distinct classes is used. LLSC+1 produces similar classification accuracy but at very low cost than LLSC. In fact, for higher dimensions, it may reduce the computational cost more than half as compared to LLSC.At last, an unsupervised classification technique is proposed. The proposed technique is a clustering technique to represent clusters in hyper ellipsoidal clustering shapes. Most of the existing clustering algorithms rely on Euclidian distance metric which results in spherical clus-ters. An ellipsoid is more flexible shape as compared to sphere. A novel clustering technique called Hyper-Ellipsoidal Clustering for Evolving data Stream (HECES) based on the recently proposed HyCARCE algorithm is proposed. HECES consists of two phases, online and offline phase. In online phase, the algorithm reads incoming multi-dimensional streaming data and arranges it in a discretized density grid space. After the user defined time span, ellipsoids are fitted over each grid cell. These initial ellipsoids are merged together to form the final clus-ters. Mahalanobis distance metric is used to decide the merger of these ellipsoids. Finally, the ellipsoids having a lower cardinality are pruned out. The results of HECES are better as compared with DenStream and Clustream. Moreover, the time complexity of HECES is O(N) which makes it attractive for applying for streaming data.
Keywords/Search Tags:Data stream, Classification, Clustering, Evolving concepts, Bernstein bound
PDF Full Text Request
Related items