Font Size: a A A

Research On Technique Of Web Log Mining

Posted on:2009-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GuoFull Text:PDF
GTID:2178360245989325Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and increasing popularization of the Internet, there are more and more Web log resources available on the web. How to analyze and use this huge amount of data has become a serious problem at present. Web Log Mining is a new technique for network information processing, and an important application of data mining on the internet. Web Log Mining is an application of data mining in web server log to obtain the pattern and the access behaviorial mode of the users. This helps to improve web site structure, its access quality and its performances.Data preprocessing is an important step of the Web Log Mining, which determines the performance of pattern recognition and pattern analysis algorithm. Web log preprocessing consists of data cleanup, user recognition, dialog recognition, path complement and transaction recognition. This thesis studied each individual steps of the Web log preprocessing, and introduced the relavent methods to each parts. Based on the analysis for the current dialog structural algorithms, a method for estabilishing a dialog by combining two time windows was presented. Frequent Sequential Pattern Mining is an important research field of Web Log Mining. Since the sequential pattern mining algorithm of the class Apriroi needs to scan sequence database multiple times, which produces enormous sets of candidate data, WAP-Tree structure was used to store transaction sequence in this thesis, which only needs to scan the database twice. The WAP-Mine algorithm produces conditional sub-trees recursively, which consumes memory space. Due to the deficiency of WAP-Mine algorithm for the WAP-Tree, a new WAP-Tree-based NWAP-Mine algorithm was proposed, and its validity has been proven by experiments. Due to the lack of weighing of web pages in the existing sequential pattern mining algorithm, a definition of interest-level based on the average dwell time is proposed. In light of the deficiency in the exsiting interest-level on web pages, a improved version of the web page intrest-level is suggested. This interest-level is weighed in the weight sequential pattern mining algorithm in finding the access path that interests the users. It has been demonstrated with experiments that using the improved interest measure in sequential pattern mining can produce access mode that better reflects user's access behavior.
Keywords/Search Tags:web log mining, data preprocessing, sequential pattern, interest measure, WAP-Tree
PDF Full Text Request
Related items