Font Size: a A A

The Research Of Mining Access Sequential Pattern In WebLog

Posted on:2008-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhuFull Text:PDF
GTID:2178360212991292Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Data Mining is devote to digital analysis and understanding, finding potential knowledge in the data. So in today, a large number of useful potential information can be taken out from the data which has been explosively growing. In recent years, Web applications are active in all aspects of social life, and the World Wide Web has become the world's largest information center. But a lot of useful information has been engulfed in massive data. It is urgent for us to apply data mining technology on Web data analyzing. So Web Mining technology was born, and has become one of the most important applications of data mining.Based on the interest in Web data, Web mining is generally divided into three categories: Web content mining, Web Structure Mining and Web Usage Mining. Web Usage Mining includes association rules, sequential patterns, and etc. Sequential pattern is a patern which is relatively high frequency of time or other modes.Main work in this paper is to study Web log mining, which is the branch of Web Usage Mining. Although it is feasible for general sequential pattern mining algorithms to Mining sequential patterns from web server log files, but after pretreatment, the Web Log Sequence Database is different in sequence structure length from the general sequence pattern database. Therefore, to meet the unique and improve the efficiency of data mining, based on general algorithm, the Web log sequential pattern mining algorithms need to be improved and enhanced.At present, the main challenge of mining access sequential pattern form Web Log is the high processing cost due to the large amount of data. In this paper, by combining the relatively high efficiency algorithms SPAM and PrefixSpan, we propose a new algorithm SPAM-FPT. Efficiency of mining in support counting and candidate sequence generation is achieved with three techniques:(1) Devise a new storage structure FPT representation to compress and record every sequence form the improvement of vertical bitmap in algorithm SPAM arithmetic. So it is efficient to calculate the support by enumerate the nonzero members in the sequence FPT. (2) A new sequence is extended by FPT representations of two sequences. And it can avoid the costly generation of large number of candidate sequences. (3)Adopt the thinking of prefix in PrefixSpan algorithm. By tracking the prefix, narrows the scanning scope of the sequence database. To accomplish, we need not to establish real projection database, but just modify the FPT representation of the frequent sequence with length 1. Finally, the implementation of the system based on SPAM-FPT algorithm in Windows2000 Server platform, FPT-Miner, is presented. And in this section we also give the performance evaluation between the three algorithms: SPAM-FPT, SPAM and PrefixSpan.
Keywords/Search Tags:Data Mining, Sequential Pattern Mining, Web Log, Frequent Sequence, SPAM-FPT, Pretreatment
PDF Full Text Request
Related items