Font Size: a A A

Research And Implementation On An Improved Algorithm On Data Stream Processing

Posted on:2011-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LiFull Text:PDF
GTID:2178360305455077Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the 21st century, people are becoming more and more dependent on computers. It can be said that a new birth of computer technology will significantly affect our lives. Now, with the development of computer technology, we need to extract more and more of the information from the Internet, so that the technologies about data stream processing are greatly developed. At present, the research on data streams includes the following aspects: 1. The analysis and modeling of data streams; 2. The data stream processing; 3. The storage of the data stream; 4. Data mining of the data stream.The data stream changes over the time. They are rapid and in disorder with a large number. So, it is a great challenge to create a data stream processing algorithm with fast speed and low storage. In this paper, my ultimate goal is to make research and implement on the Containment-encoded Intervals (CEI) algorithm on data stream processing, to improve and finally implement it. We name the improved algorithm ICEI. In the following, I will describe the main parts in this paper:In the first part, we give the preparation knowledge for the whole paper. The first chapter describes the current development and applications of the data stream technology. And we also describe the current development and applications in data stream processing. Couple of data stream processing algorithms are summarized, and pointed out its challenges. The problems need solving in this paper are analyzed. Finally this chapter describes the basic organizational structure of this paper. The second chapter is to further understand the data stream. It describes the concept of data stream, introducing the characteristics of the data stream and its data structure types. And, the data stream model and the basic data stream processing method are introduced.The second part is a theoretical research and implementation of the algorithm. Chapter three is the research and implementation on the data stream processing algorithm: containment-encoded intervals (CEI). CEI is an indirect index structure, which greatly reduces the traditional index's query processing time. It changes from o (log N ) to o (1). Section one describes CEI's basic data structure. It is based on VC structure. The basic idea is: the data stream breaks into ?? r /L?? VC unit intervals, where L is the width of VC unit interval, and r is the maximum value of the input data. The number of VC in each unit interval VC is 2 * L-1. The structure of the form likes a complete binary tree. And I have proposed a mapping rule. The rule is to use the least VCs to contain the whole interval. Then we can create the index. In Section two, I describe the CEI and explain its main function. At last, I program the CEI algorithm.Chapter four is the most important part of this paper. An improved containment-encoded intervals (ICEI) algorithm is presented. The main idea is to reduce the number of VC. The method is to get a variable in order to reduce the depth of the complete binary tree. ICEI improves the performance of the algorithm. The ICEI not only quickens the queried processing speed, but also improves the storage. ICEI is more adapted to deal with large-scale data stream processing information.The ICEI origin from that the CEI's structure has a large problem. The problem is that CEI creates a great number of VCs which is only related to r. So, in some conditions, a lot of VCs may be residual. In this case, much of storage will be wasted.For this reason, we come to think that: how can we save these storages? After months of research, I find two methods to improve this situation. One is to remove the deep levels except the bottom level. The other is to do special processing for the bottom level. We do not allocate storage space to the unit VCs except it has established an index.Next, we introduce the structure of the algorithm. We describe the whole design of ICEI, and the whole structure diagram is given. And in accordance with the effect of different functions, this system is divided into three modules: data preprocessing module, index creating module and data manipulation processing module.At last, I make research on the three modules of the ICEI in detail. Program all the functions. And then experiments are made.The third part is trial and concluded, pointing out our contribution and the prospects for future work. Chapter five is the experimental part. We do a large number of experiments for the two kinds of algorithms, and two kinds of algorithms are the same input data in every experiment,and then we analyze the results of the experimental. Three groups of representative experiments are made.In the first section, we introduced the experimental environment. In the experiments, we adopt the integer analog data. According to demand, I create a procedure to generate random numbers.In the second section, after the analysis in the experiment result, we get the conclusion: ICEI is a better algorithm than CEI. ICEI is based on fast query speed. And also, in many cases, the storage space of the ICEI is much better than that of CEI.So we can say that the ICEI is feasible and effective. It is more suitable to deal with high-speed, large, and real-time data stream information.Chapter six is the last chapter in my dissertation. Firstly it carried out a summary of this article. Secondly, it gives two thoughts which may be considered to improve the data stream processing algorithm.One is Margin Blurring Containment-encoded Interval (MBCEI). This approach is designed mainly to reduce the large number of VCs. Maybe we can use edge processing blurry strategy in order to eliminate the bottom of VC layer in ICEI.And then the storages will be saved because the numbers of VCs are reduced. However, the edge blur leads the information of the bottom index structure to become uncertain. This will increase the algorithm's time complexity. Therefore, we need to build a new algorithm with fast processing speed and less storage space.The other one is a new kind of indirect space indexing algorithm which combines the traditional spatial indexing technology with CES.This algorithm is a combination of CES/CEI with the Quad Tree. We must improve the traditional algorithm's query time, but also greatly reduce the storage of CES.
Keywords/Search Tags:Data Stream, CEI, ICEI, Index, Storage
PDF Full Text Request
Related items