Font Size: a A A

Efficient Algorithms For Processing Data Streams And Massive Text

Posted on:2012-09-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J WangFull Text:PDF
GTID:1118330335962387Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, data streams have been more and more commonly used in many real world applications. For example, base station receives sensor data streams from sensor nodes in a wireless sensor network; click streams are generated when people surf the Internet and so on. Data streams are usually characterized by large volumes of data, rapid generation speed and evolving data distribution. Traditional data processing algorithms are only suitable for static finite data which can be stored on hard disk. These algorithms always need several scans on the stored static data. So they can not be applied in processing data streams.Thus how to efficiently utilize limited storage and computation resource to process data streams becomes an important task. Typical data stream processing problems include clustering, change detection, duplicate detection and so on.With the rapid development of the Internet, how to efficiently process massive text data (e.g. a large number of web pages) becomes an important task for many applications such as network information management. Traditional text processing algorithms are usually only suitable for small scale text data. Typical text data processing problems include text classification, text clustering and so on.This thesis focuses on the problems of efficiently processing data streams and massive text data. Its main contributions are:1. A change detection method for high-dimensional data streams We propose a change detection method for high-dimensional data streams. This method divides the data space of a data stream into cells with same volume, then detects the distribution changes in different combinations of cells called grid. Based on Hoeffding bound, we set the upboundεof the variance of experiential distribution for the grids (on the hypothesis that the underlying distribution does not change). Then, inspired by FP algorithm [Han et al 2000], a modified data structure of VT-Tree is defined to save the change of experiential distribution in every cell and all grids with variance greater thanεare found. Simulation results on empirical data show that our proposed method is quite effective.2. Clustering high dimensional data streams with representative points The fixed-size interval partitioning adopted in traditional grid based clustering methods can not capture clusters in each dimension well when they are applied in evolving high dimensional data streams. In this thesis, we quantify each dimension (attribute) of data points separately and use the generated representative data points for each dimension to substitute the fixed-size interval. These data points are updated with incoming data points continuously so that they can capture the cluster trends in each dimension more accurately than the fixed-size intervals.Experiment results on synthetic and real data sets display the high effectiveness and accuracy of the proposed method.3. Improved online detection of duplicates in streaming data over sliding windows In this thesis, we propose a new data structure, namely Flag Bloom Filter (FBF), that improves the Decaying Bloom Filter (DBF), the known most efficient variant of Bloom Filter, and an efficient algorithm based on FBF for online approximate detection of duplicates over sliding windows. Given the same sliding window size W and M counters, with M bits more memory our FBF reduces the false positive rate (FPR) of DBFby a factor of 2k/(k+1), where k= [ln(2)M/W]≥2 is the number of hash functions. When using the same memory size G and sliding window size W, our FBF uses a factor of 1/(log W+1) fewer counters than DBF and has an FPR bounded by (0.25)k(1-1/logW)(1+k(1-1/logW)) which is smaller than the FPR of DBF when W≥32.4. Approximately detecting duplicates for probabilistic data streams over sliding windowsDuplicate detection method for certain data streams can not maintain existential probabilities of elements in a probabilistic data stream. In this thesis, we presenta novel data structure, Floating Counter Bloom Filter (FCBF), as anextension of Counting Bloom Filters which can maintain these existential probabilities effectively. Based on FCBF, we present an efficient algorithm to approximately detect duplicates for probabilistic datatreams over sliding windows. Given a sliding window size W and floating counter number N, for any t which occurs in the pastsliding window, our method output the accurate existentialprobability of t with probability 1-(1/2)ln(2)*N/W. Experimental results on the synthetic data verify the effectiveness of our approach.5. An improved KNN method for massive text classification As a simple, non-parametric, and effective method, KNN has been widely applied in text classification. But in handling massive text, both classification accuracy and classification time become unacceptable. To solve this problem, based on minimizing the increment of learning errors and combining LVQ and GNG, we propose a new growing LVQ method and apply it to text classification. Our method can generate a representative sample set after one phase of selective training of a sample set, and hence acquires a strong learning ability. Experiments show that this method can reduce the testing time of KNN, and maintain or even improve the accuracy of classification.
Keywords/Search Tags:text classification for massive text, data stream, change detection, clustering, duplicates detection
PDF Full Text Request
Related items