Font Size: a A A

Research On Some Key Techniques Of Distributed Data Stream For Query Processing

Posted on:2007-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:1118360215977611Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of large network and Web application, a new kind of data—Data stream come into being in the application areas of monitor and sensor networks in-break detecting, communication data management stock analysis and so on. These data sequence are relational tuples,sensor values,network parameters,phone records and share data etc. Different from traditional database application model, data stream model have characteristics as below: (1) Continuous, real time arrival; (2)Vast, unrestrainted and hard estimated; (3) Unless storing especially, they can't be taken out to handle again, namely one-pass processing. Otherwise, withdrawing data again is very expensive. How to store,query and mine these data streans has become the hot issues in the field of international database currently.In many practical application, such as decision support system, query optimization, the exact values are not necessary to obtain but approximate value. Therefore, the core issue of data stream management and analysis is designing one-pass algorithm, namely the feature structure of data streamsynopsis data structure are unceasingly updated in the least memory than data scale in order to quickly achieve approximation answer in time at all hours. If the length of data stream is N, the size of synopsis structure are not excess to O(polylog(N)), and the processing time of each group of stream are not excess to O(polylog(N)).The primary query in traditional database is one-time queries, namely the system give answer according to the snapshot of data set. But in data streams, the query is continuous and long-run because the query answers are returned incessantly along with new data. The data streams are not static but ceaseless insertion and update. Users need dynamic monitoring along with the changed data streams but not static result of some time.The exist research of data stream are centralized stream system. But the essence of data stream is distributed. More and more practical data in sensor network,communication,Internet flow analysis and Web log etc, are transported from teledata sources. Therefore, we need to construct the middleware of distributed query processing for data streams to support above application.P2P technique is used to build a huge distributed network for the processing of large number of information by internet terminals. These computers (Peers) are equal in independent computing and processing to solve the bottleneck issue of single server. P2P can build the good platform of middleware, which validly provide real-time query answer quickly, by means of its good performance of scalability,load balance,high efficiency and dynamic adaptability based on content-indexing in order to reduce the overhead of network bandwidth and links.Focusing on distributed data streams, this thesis analyzes the domestic and international present research. From the current existent problem and shortage, we study the data stream's characteristic based on time variety, monitor the real time streams, investigate on the presentation and modeling methods for stream's variety, mine the data evolution with the trend of the variety, and forecast the future trend of data streams. In large-scale distributed environment, we study the distributed approximate algorithms for query processing and mining with minimum complexity of time and space. On one hand, we utilize the wavele coefficients by DWT to construct histogram with good precision, and extent to multi-dimension histogram to solve the traditional difficulty, and build a composite indexing structure with data stream synopsis, at the same time, we build wavelet-neural network model with the wavelet resolution to solve the tradition difficulty of hidden number of nodes, and elementaryly establish the forecast model of time sequences. On the other hand, we solve the difficulty of aggregate query in data streams by utilizing sketch tenique. We study the algorithm of frequent items in distributed data streams and set the precision gradient to reduce the load of communication. Meanwhile, with the platform of Chord network and protocol in P2P system, we study the middleware of distributed query processing and mining data streams, investigate on key functions and techniques such as query routing,lookup and location,indexing and mapping of data stream synopsis in the peer computing system. That is concrete to say, this thesis has four central innovations as follow:(1) Studying the query processing algorithm of distributed data streams based on wavelet.The first step is according to the theory of discrete wavelet transition (DWT), we decompose Haar wavelets in order to obtain the important wavelet coefficients.Then, we analyze the computing model of data streams, formalize the stream's query model. Based on above research, we provide a kind of new method to construct synopsis data structure of data streams, establishing a compound indexing construction to handle the inner product queries and similarity queries. In addition, combining the good performance of time-frequent localization in wavelet neural network (WNN) and self-study in neural network (NN), we establish the forecast model of time sequence.(2) Studying the query processing algorithm of distributed data streams based on sketching technique. Firstly, through analyzing the approximate sketch-based processing methods, and utilizing the random techniques, we computing the pseudo sketching synopisis to the on-time arrival streams. Based on above research, we provide a novel technique of sketching partition through intelligent division for the value range of attribute and reduction of the self-join scale to fairly allocate memory space. In result, we ensure the good estimated quantity of approximate query.(3) Studying the mining algorithm of frequent items in large scale distributed data streams. Firstly, we formally define the mining algorithm of frequent items base on time point. Then, we take forward to the Distributed Merging Algorithm (DMA), a hierarchical structure to discover the stream synopsies from all leaves to root node, through setting the precision gradient to optimize the communication load and overhead of distributed system.(4) Studying the query processing middleware of distributed data stream based on P2P system. The first step is utilizing the P2P characteristic to improve the procedure of location searching and stabilization.Then, the data stream synopisis are mapping to the improved Chord node and content-based routing are extended to distributed stream index. Based on above research, the continuous approximate query are providing to improve the efficiency of query and mining processing for data stream. In meanwhile, the optimization method of Minimum Bounded Rectangle (MBR) is introduced to reduce center data and minimize the computing resource of network link. Through adjusting adaptively the high and low bounding of MBR, The precision of system are improving to achieve the query answer in time.By means of the sub-project "Web Services-based E-commerce" in 863 national project "Web Services-based database new technology", the research work in this thesis utilize large number of domestic and international references,theory analysis and experiment test. Estimated experiment and analysis illustrate that above algorithms and the middleware have good performance, which can be maked use of in running and development environment of data stream applications.
Keywords/Search Tags:Distributed Data Stream, Synopsis Data Structure, DWT Semantic Query, Consistent Hash, Middleware
PDF Full Text Request
Related items