Font Size: a A A

Frequent Sequence Pattern Mining In Web Log

Posted on:2008-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhouFull Text:PDF
GTID:2178360215990922Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development and application of Internet,the resource of Web log is becoming more and more abundant.The prominent problem is how to analyze and use the great amount of data.The Web log mining is a new technique of Web information processing,and it is also an important appication of data mining in the Internet domain.Frequent Sequence Mining is an important aspect in Web Log Mining.Through it,the user can improve the structures of Web sites and performance. Sequence pattern mining was put forward by R.Agrawal and R.Srikant in 1995.Sequence pattern mining is to find all the frequent sub-sequences (That is to say the frequence of the sub-sequences is no less than the given min_sup).The process of Web Log Mining including three phases:date preprocessing,pattern finding and pattern analyzing.This paper mainly research in Data Preprocessing and pattern finding.Data Preprocessing is a key roll in Web Log Mining,it determines the algorithms'performances of pattern finding and pattern analyzing. Data Preprocessing of Web Log Mining comprises of five phases:Data Cleaning,User Identification,Session Identification,Path Completion and Affair Idenfication.This paper mainly research in all rolls of Data Preprocessing,introduces the solution of some especial problems in this process ,by the analysis of web server log format ,this paper give the formal descriptions of the concept of session,on the basis of analyzing the current session construction methods, it mainly proposes the time-referrer-based heuristic method that can be used to construct sessions.There are some comparabilities between Sequence Pattern and Association Rule,but still some difference between them.This paper compares Sequence Pattern with Association Rule,It make me understanding Sequence Pattern more clearly.The current frequent sequence mining algorithms are on the basis of Apriori,when creating a k-frequent itemset,these algorithms should scan the whole session database,so the cost is very high,while this paper adopts the suffix-tree-based method,this method can efficiently solve the deficiency of Apriori. Suffix Tree is a data structure,it is a compact tree which save the suffix of the given string.The cost of constructing a suffix tree is related to the length of the string,while finding maximal frequent pattern is accomplished by Depth_First_Search.For these two algorithms all can run in linear time,this method can mostly improve the efficiency.By trial, time-referrer-based heuristic method could improve the validity of the User session ,and suffix-tree-based method could be more efficient and convient.
Keywords/Search Tags:Web Log Mining, Data Preprocessing, Session Identification, Maximal Frequent Sequence, Suffix Tree
PDF Full Text Request
Related items