Font Size: a A A

Research On Frequent Traversal Sequence Mining Based On Web Click Streams

Posted on:2007-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhangFull Text:PDF
GTID:2178360212495463Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
By analyzing web usage mining situation of foreign and domain, we found that the previous algorithms exist many problems for mining frequent traversal sequence(FTS) in static and dynamic web clickstreams environment. These algorithms may be inefficient in dealing with the following four problems. Firstly, Web association rules mining often ignores the temporal character of sessions. Secondly, simple algorithms of frequent sequences mining may generate very huge candidate sequences, since those algorithms do not adopt the constraint methods. Thirdly, most incremental mining algorithms just consider inserting the transactions into the original traversal sequences. However, few algorithms refer to the situation that when the new traversal sequences are inserted into and the old traversal sequences are deleted from session database. Fourthly, there are some difficulties to extend the mining technique to dynamic web clickstreams. The main difficulty is that the existing False-positive based algorithms can not overcome the dilemma between the recall of sequence and the precision of mining result.Firstly, two efficient algorithms are proposed to discover maximum frequent traversal sequences. The two algorithms all use the bidirectional constraint strategy to constrain each page and sequence in session database. The first algorithm successfully limits the appearing frequency of the meaningless page, and the second algorithm solves the problems caused by the longer sequence. Two kinds of algorithms outperform GENmax, FPmax, SPADE, MSPS and GSP for mining maximum frequent traversal sequences.Next, according to sequence model, this paper proposes a novel algorithm, to account for the incremental mining problems. The algorithm utilizes the constraint strategy and web site structure to find the new FTS from the inserted and deleted part of the database such that the size of candidate sequence sets andsearch space of FTS can be reduced. The performance of the algorithm is better than IncSpan and MFTP when the previous mining results and constraint are used.Finally, based on False-Negative approach and time-sensitive sliding window, this paper proposes an efficient algorithm to balance the conflict between recall and precision in dynamic web click streams. The algorithm adopts the weighted harmonic count function with constrain to calculate the support count of each traversal sequence, and adjusts the value of relaxation ratioρby changing the regulatory factorξ. The algorithm can simultaneously meet recall and precision, and outperforms Lossy Counting and MineSW.We implement the above four algorithms with Java and C++. All of our experiments are performed on the four datasets, M0606A, M0607B, BMS-WebView-1 and BMS-WebView-2 to mine maximum frequent traversal sequences and FTS. The experimental results show the feasibility and effectiveness of our algorithms.
Keywords/Search Tags:Web click streams, Frequent traversal sequence, Dwell time, Session, Bidirectional constraint, Time-sensitive sliding windows
PDF Full Text Request
Related items