With the rapid development of science and technology, many new applications, such as sensormonitoring, networking and cloud computing, etc, are emerging. These new applications couldproduce time-varying, infinite, high-speed, unpredictable and long-running streams, which need tobe processed online and non-blocking. However, the traditional data processing algorithms mainlyhandle persistent, limited, deterministic data and can not perfectly process high-speed data streams.This paper focuses on several high-speed data streams processing algorithms, which are mainlyrelated to window joins and correlation analysis over data streams. These algorithms have appliedto a traffic information management system. Research work is composed of the following threeaspects.1. Several algorithms are proposed to tackle simultaneous window joins over large scaleuncertain data streams. Firstly, a distributed thread is developed to pre-process and distribute data,and several processing thread to complete data insertion, comparison and join operation. Thealgorithm also use hash index and sorting list to accelerate processing thread. As for eliminatingexpired data, two strategies called FTD (fixed-time delete) and RTD (real-time delete) areproposed. The FTD deletes expired data at fixed interval time, while the RTD is a piggybackmethod as deleting data with searching new data. Compared with naive algorithms using OracleTimesten memory database, the proposed algorithms is faster about2-8times.2. Once the processing speed can not keep up with the arriving of the new data streams, thenew data will be accumulated in memory and gradually result into memory overflow. Twoalgorithms are proposed to.solve the problem. The basic idea is flushing data into disk whenmemory is overflow, and fetching data in disk to process when memory is free. Algorithms alwayskeep recent arrived data with large probability in memory, and save small probability data to disk,which guarantees the real-time arrived stream data to join and firstly output the high-probabilitydata.3. This paper presents a new on-line correlation analysis algorithm called Base_win_CCA,which analyses correlation over multidimensional data streams based on base-windows. Thealgorithm keeps statitistics of a lot of raw data in every sliding window of single stream asbase-statistics with deleting those data. Furthermore, statistics of multiple sliding window scopecan be calculated incrementally. Base_win_CCA overcome the shortcomings of that some researchonly mainly concentrated on correlation analysis over a single sliding window and stored a lot ofraw data, and it is quite flexible and accurate. Theoretical analysis and experimental results showthat the algorithm is particularly superior in a scenario that larger base window, larger querywindow, more streams and more users. |