Font Size: a A A

Study Of Mining Data Streams Based On Semi-Structured Data

Posted on:2012-06-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:B FengFull Text:PDF
GTID:1488303356972679Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Data stream appears as one of the most important data types, and has been widely used in areas such as network traffic monitoring, click-sequence analysis. Comparing with the traditional database models, data stream usually has characters of large-amount, fast-speed, and time-sensitive. Firstly, mining algorithm in data streams should be as efficient as possible to meet the demand of large-amount and fast-speed. Secondly, time-sensitive feature requires that the algorithm reflects the latest trends, and eliminate the historical data's influence to the result.Meanwhile, semi-structured data appears widely in areas such as semantic web, chemical compounds analysis and social structure mining. Frequent pattern mining in such data can discover useful information, and also, can be a basic step on other data mining tasks. However, semi-structured data contains not only data content, but also relationship between the data. It would be hard to mine such kind of data using traditional methods.Semi-structured data is largely used in data streams. Because of the complexity in both kinds of mining, there are few algorithms mining semi-structured data in streams. We unify the two aspects of difficulty, and propose a way to mine the complex data in stream environment. In my research, a high-efficient algorithm of mining tree-shape semi-structured data in streams was firstly proposed. The algorithm solves the low-efficiency problem in stream mining, and is useful in the research promoting of this areas. In this thesis, we study the mining problem in four aspects, which are detailed shown as follows:1. Decay strategy to time-sensitive data streams Time-sensitive is the basic feature of stream mining. We propose a decay strategy to reflect this feature. The strategy contains two aspects: attenuation and compensative amplification. Attenuation is used to eliminate the influence of historical data. Compensative amplification is used to compensate the information loss caused by over-attenuation; Amplification can also strengthen the influence of the new arrival data. A parameter constraint between attenuation and amplification is proposed to guarantee the output of frequent patterns. Our experiments show that the decay strategy has good effect on time-sensitive gain.2. High-efficient batch model for mining data streamsUsually, data streams come in batches. The transaction-by-transaction adding and repeated mining way of traditional methods cause huge system overhead which can not meet the demand of high-efficient mining. We propose a batch-mining way. Firstly, we preprocess the new arrival data in a batch mode to get the intermediate result. And then, we add the intermediate result to the existing result with set operations to get the final result. This method largely reduces the computation exponentially and meets the nature of data streams which is much faster than traditional ones. This method can be applied to data mining tasks such as frequent pattern mining, classification and clustering.3. Connection-based algorithm mining static tree-shape dataA new frequent tree-shape data mining algorithm to static data CFTMiner is proposed based on the state-of-art algorithm DryadeParent. The algorithm uses a connection-way to generate candidates of sub-trees. To solve the information loss problem in DryadeParent, we propose a new initialization method for candidates. This method eliminates the information loss when initializing, and avoids repeated database scanning in later iterations. Our experiments show that CFTMiner is more efficient than DryadeParent both in time and space.4. Frequent sub-trees mining algorithms in data streamsThere are few algorithms mining semi-structured data in data streams. We make an integration of the researches talked above, and propose an algorithm called SFCLTreeMiner to mine closed frequent sub-trees in data streams efficiently. SFCLTreeMiner is based on sliding windows and decay strategy proposed by us. When new data stream arrives, we use CFTMiner for batch preprocess and get the intermediate result. For frequent tree-shape pattern mining, an Add-Remove algorithm for set operation computation is introduced, which incrementally mine frequent sub-trees in data streams. In our experiments, real-world data sets are used to test the efficiency of the proposed algorithm SFCLTreeMiner.
Keywords/Search Tags:Data stream, Semi-structured data, Frequent pattern, Decay strategy, Batch mode, Tree-shape data
PDF Full Text Request
Related items