Font Size: a A A

Selected Problems On Mining Massive Strfaming Data

Posted on:2016-12-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Q YuFull Text:PDF
GTID:1108330461485407Subject:Computer software and theory
Abstract/Summary:Request the full-text of this thesis
With the rapid development of mobile devices, sensor technique, and Internet, data streams, as a typical kind of big data, have widely appeared in various fields. Nowadays, data streams contain many types of data, such as spatial-temporal data, sensor data, and social network data. The problem of data stream mining is very important because data streams include much valuable information. Compared with the static dataset, data streams own some obvious characteristics:the volume of data streams is always huge, the arrival rate of data streams is varying, the data streams usually need to be processed in a real-time fashion, and they hardly can be obtained again unless they are stored after being processed. Because of these characteristics, most of existing data mining algorithms hardly can be directly applied together data stream mining. Therefore, studying the problems about data stream mining is valuable and imperative.This thesis discusses two problems about data stream mining. The first problem is discovering frequent co-occurrence patterns across multiple data streams. The frequent co-occurrence pattern refers to a set of objects occurs in one data stream within a short time span and this set of objects appears in multiple data streams in the same fashion within another user-specified time span. Our goal is finding all frequent co-occurrence patterns from multiple data streams. In reality, many applications can be abstracted as the problem of mining frequent co-occurrence patterns on multiple data streams, such as discovering groups of cars that travel together using the city surveillance system, mining the goods that are sold together from the electronic commerce data, finding the people that are hanging out together based on their check-in data, and mining the hot topics by discovering groups of frequent co-occurrence keywords from social network data. To address this problem, we first partition the data stream into multiple segments, and then propose DIMine and CooMine approaches. Two methods first index all valid segments, and then design the pruning strategies based on the segments index structure. At last, they can mine frequent co-occurrence patterns efficiently by pruning the search scope.DIMine and CooMine approaches can obtain the good performance with respects to the mining efficiency, memory consumption, and maintenance cost. However, these two approaches are designed for a centralized setting where they are suitable for running on one single machine and they hardly can be adopted on the distributed environment to deal with the huge volumes of data streams. In order to mining frequent co-occurrence patterns from the large scale of data streams, this thesis proposes a distributed mining approach. This approach first generates all candidate patterns that potentially can form the frequent co-occurrence patterns from all data steams, and then sends the same candidate pattern from different data streams to the same computing node by a hash function. In this way, it can determine which candidate patterns are frequent co-occurrence patterns. Because each candidate pattern is independent, and different candidate patterns can be processed by multiple computing nodes concurrently, thus this approach can compute the frequent co-occurrence patterns in a parallel fashion to achieve the good scalability.The second problem is processing k nearest neighbor queries over huge volumes of moving objects. The spatial-temporal data is a typical kind of data steams. The k nearest neighbor query is a fundamental operation of many data stream mining problems. For a given set of moving objects, a k nearest neighbor query requires to search k objects that are nearest to this query point. The problem of processing k nearest neighbor queries has been studied extensively but most of existing algorithms assume that the volume of spatial-temporal dataset is limited and the number of queries is also small. Most of them are designed for the centralized setting where query processing takes place on a single server, it is difficult, if not impossible, for them to scale to a distributed setting to handle the vast volume of data and concurrent queries that are increasingly common in those applications. To address this problem, we propose a suite of solutions that can support scalable distributed processing of &-NN queries. We first present a new index structure called Dynamic Strip Index (DSI), which can better adapt to different data distributions than exiting grid indexes. Moreover, it can be naturally distributed across the cluster, therefore lending itself well to distributed processing. We further propose a distributed k-NN search (DKNN) algorithm based on DSL DKNN avoids having an uncertain number of potentially expensive iterations, and is thus more efficient and more predictable than existing approaches. DSI and DKNN are implemented on Apache S4, an open-source platform for distributed stream processing. The experimental results show that our proposal scales well and significantly outperforms the alternative methods.This thesis studies several problems of data stream mining and proposes the special approach for the special problem. The contributions can be summarized as follows.We first propose the problem of mining frequent co-occurrence patterns across multiple data streams, and design DIMine and CooMine two algorithms.We propose a distributed method for discovering the frequent co-occurrence patterns from the large scale of data streams.We design a distributed search algorithm for processing concurrent k nearest neighbor queries over huge volumes of moving objects.
Keywords/Search Tags:streaming data, distributed processing, k-NN query, frequent co-occurrence pattern
Request the full-text of this thesis
Related items