Font Size: a A A

Study On Duplication Detection Of Data Streams

Posted on:2012-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:S XuFull Text:PDF
GTID:2178330338953835Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data streams is a widely existing data form, such as stock information data analysis in financial markets,data transmission in network, call records data in telecommunications department,data in web click and so on. Due to its limitless,real-time and high-speed etc. characteristics, it brings a great challenge to the data streams analysis and data mining. Especially, the repetitive data on data streams, i.e the incorrect repetitive data caused by software and hardware failure and topology structure, caused a great deal of influence in data streams-associated analysis,correlational analysis and statistical analysis. Therefore, this paper focuses on the repetitive data detection technology in data streams.Firstly, this paper introduces some related work which includes data streams and its model,summary technology and some duplicate data detection technology. And then points out the problem that SBF may not reduce misjudgment rate and may increace the waste of system resources, etc. Secondly, The duplicate data detection in this paper is mainly for the data streams which is high speed, real-time, massive and variational, it requires the detection method has an online processing and real-time response characteristics. Therefore, this paper proposes ABF (Adaptive Bloom Filter), which is an adaptive duplicate data detection technology based on Bloom Filter. The research mainly includes:(1)This paper proposes a method of duplicate data detection based on the error constraints in Bloom. The method uses the summary structure of sliding window data and split the window in order to adapt to the changes of window. And this paper gives a Bloom Filter length determined theory of data block in the constraints of user specifies misjudgment rate. This method can guarantee the misjudgment of users, at the same time, it simplifies the update operation of data summary and accelerated the speed of duplicate data detection.(2)In order to reflect the changing of duplicate data, this paper puts forward an self-adaptive window sliding strategy. It can automatic change the future's window size and sliding step length according to the intervals detection of duplicate data , so as to improve the detection accuracy and efficiency. Through analysis it can be concluded that this method can only produce false-positive, and have no false-negative.(3)This paper further presents a duplicate data detection technology based on ABF under distributed data streams environment. The technology keeps a copy of BF in other machines, and transfers the bits mapped by non duplicate data to the copy of the other machines. Then compares it with the BF in each copy of machines and repeating data is detected. This technology guarantees the same mistake the detection rate with centralized system, and it also has a higher dimensional utilization rate and lower network communication costs.At last, theoretical analysis and experimental results show that the algorithm has a higher precision, lower time and space complexity, it's more suitable for the application of data streams.
Keywords/Search Tags:Data streams, Duplicate data detection, Bloom Filter, Sliding window
PDF Full Text Request
Related items