Font Size: a A A

Research Of Algorithms On Frequent Pattern Mining Over Data Streams

Posted on:2007-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:L F JiaFull Text:PDF
GTID:2178360182496280Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining, also called knowledge discovery, is to find potential, un-known but useful information from huge databases or datawarehouse. Datamining integrates databases, AI, machine learning and statistics etc. It is anew and valuable field in database research. Data mining tools could ana-lyzes data deeply and predict the future trend and actions.Recently,database technologies are developed quickly and appliedwidely. Data models are various, and the size is growing. The progress oftechnology makes the system generate continuous and fine data. Therefore,a new model of database is generated and applied in many industrial andbusiness application, which is called streaming data.In this paper, we systematically study the problem that how to find fre-quent itemsets in streaming data. Data mining in streaming data is a noveland hot research field. Traditional data mining algorithms are unsuitable forthe infinity and continuity of streaming data, so devising algorithms formining frequent itemsets in streaming data is challenging.To understand the problem of data mining in streaming data better, wefirst introduce it from two aspects: analyzing the streaming data and man-aging the streaming data. After presenting the task of data mining instreaming data, we define the streaming data model more precisely and thengive out some common streaming models. Through carefully differentiatingthe streaming data and traditional databases, we summarize four character-istics of streaming data: (1) data arrives real-timely;(2) the sequence of datais independent and hard to be influenced by systems;(3) the size is hugeand unpredictable;(4) the streaming data can only be scanned once. Basedon the discussion so far, whatever operations are taken on streaming data,the requirement of single scan conflicts with that of accurate query result.To balance this conflict, algorithms devised for streaming data are requiredto summarize the information by a single scan.Ming frequent itemsets in streaming data face the similar problem. Itsgoal is to find the complete set of frequent itemsets. In the traditional data-bases, many algorithms succeed in mining the whole set, and many scan thedatabase twice or even more times. The price of milt-scan is tolerable forthe traditional databases, but unacceptable for streaming data.Based on the introduction of streaming data, we could draw the conclu-sion that the nature of streaming data makes the algorithm which only re-quires scanning the whole dataset once be devised to support aggregationqueries on demand. In addition, this kind of algorithms usually owns a datastructure far smaller than the size of whole streaming data. In VLDB2002,Manku et al proposed the well-known algorithm called Lossy Counting al-gorithm. Lossy Counting algorithm defines the Estimation Mechanism tosolve the problem of mining frequent itemsets in streaming data. Althoughthe algorithm scans the data only once, it could give out the approximatefrequency by Estimation Mechanism. Based on the approximate informa-tion, Lossy Counting guarantees that all frequent itemsets in streaming datais discovered but also a few infrequent ones are included.Han et al proposed the FP-stream algorithm based on the well-knownFP-growth. FP-stream not only utilizes the batch-proceed and EstimationMechanism ideas, but also presents a new time-titled technology to differ-entiate the transactions with various time-sensitivity, which depends on thehypothesis that people are more interested in recent transactions than obso-lete ones.A more applicable problem concerning frequent itemsets attracts peo-ple's attention, which is mining recently frequent itemsets in streaming data.The core problem of mining recently frequent itemsets in the data stream ishow to differentiate the valuable information contained by recent and newtransactions from the useless information in outdated transactions. InVLDB2003, Teng et al proposed a algorithm that mines recently frequentitemsets by the sliding window model, which demonstrates both the neces-saryness and feasibility of this problem. Chang et al proposed the estDecalgorithm which took a novel regression strategy that all transaction instreaming data are allocate with weights which are lessened with lapse oftime. Although these two regression strategies succeeded in mining recentlyfrequent itemsets in streaming data, they still show the drawbacks in somecases.To avoid the drawbacks above, we propose a regression-based stream-ing data model where all transactions in streams are reflected into manymultiple time granularities according to the time-sensitivity. Generally, newtransactions are more time-sensitive and contain more valuable information,and thus should be attached more importance than old transactions. In themodel, new transactions are usually assigned to a fine and small timegranularity, and old ones are assigned to a coarse and big time granularity.When transactions transfers from a fine time granularity to a coarse one,which means transactions are in the progress of turning old and obsolete,their influence is weakened by fading factors to give prominence to the in-fluence of new transactions. That fading factor actually defines a desiredlife-time of the information of transactions in the data stream.In this paper, RFIMiner, a suffix-tree-based algorithm for discoveringrecently frequent itemsets over streams, is developed by us. By using a par-ticular property between suffix-trees, the connectivity of two nodes in a suf-fix-tree is indirectly but fast confirmed by judging whether correspondingidentical nodes synchronously occur in two specific suffix-trees. Frequentitemsets are discovered by the growth from their frequent subitemsets, socandidate itemsets are avoid. If we employ the traditional bottom-up ortop-down strategies to make itemsets grow, a huge amount of redundantcomputation is generated. To cater to special properties of suffix-trees, anovel itemset growth method is also presented to release time-consumed butuseless redundant computation.In this paper, we conduct some research on data mining in stream-ing data, especially frequent itemsets. We proposed a novel algorithm forrecently frequent itemsets in stream. We conduct detailed experiments toevaluate the performance of algorithm in several aspects. Results confirmthat the new method has an excellent scalability and the performance meetsthe condition which requires better quality and efficiency of mining recentlyfrequent itemsets in the data stream.
Keywords/Search Tags:Algorithms
PDF Full Text Request
Related items