Font Size: a A A

Research On The Algorithm For Mining Web Sequential Patterns From Web Log

Posted on:2016-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J P XiangFull Text:PDF
GTID:2308330479984839Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the Internet and the increment of e-commerce sites, a lot of web log data has accumulated. In the web log mining area, it has become a research hotspot that how to find the habits and hobbies of behaviors which happen when users are accessing web sites, from such a huge number of log files.Web frequent sequential pattern mining, which mines the frequent pages and paths produced when users access Web sites, is an important research branch of Web log mining. The mining of these interesting patterns is important for the website to adjust their organization structure to adapt to habits of user’s access. In this competitive e-commerce environment, optimizing the organization structure of the website is also an important mean to attract and retain users. And the mining of these interesting patterns is beneficial to provide users with personalized service and recommend goods they may like. In addition, advertisements, which are put in mining frequent path, can make themselves accessed by more users. Therefore, the research of Web frequent sequential patterns mining is very important for the development of e-commerce sites. The main research contents and results of this paper are as follows:1 The RLDWAP, which has significant improvement in space complexity, is proposed based on the PLWAP in this paper. The PLWAP uses the position code to determine the position relationship of two nodes in the tree in order to avoid the defect of the WAP which needs to recursively construct a large number of condition trees. The position code of the current node is coded by appending 1 to the end of position code of each father node in the tree, while the position node of its right brother node is coded by appending 0 to the end of the position code of the adjacent left brother node. Thus,when the height or width of the tree is large, the corresponding position code becomes very long. In addition, the storage of the each node’s position code needs to consume more space, and reading the code needs to traverse the pointers. Aiming at this problem,an improved algorithm is proposed based on the PLWAP, which adopts the RLD traversal(traversing the Right subtree firstly, then the Left subtree and the Root node lastly) WAP- Tree, and records the last descendant child of the current node during the traversal. With the RLD traversal sequence number and the traversal sequence number of the last descendant node of the current node, the position relationship of the two nodes can be determined. Thus, the storage space of each node and the time todetermine the position relationship between the two nodes can been reduced.2 In order to adjust to different requirements, the BCWAP, which has significant improvement in time complexity, is proposed based on the PLWAP in this paper. The analysis of the construction procedure of the PLWAP’s header table shows that it remains the same in each recursive mining procedure. And the header table connects each item set and all of the nodes with the same identity in the tree to a queue. Each time searching the first node of the item set in the suffix tree needs to traverse the queue from the first node, and it is not necessary to judge above the nodes of suffix tree.Aiming at this problem, combining with the position relationship that identified by the RLD traversal sequence number, the improved algorithm, BCWAP, reconstructs the suffix tree header table in each recursive mining procedure, and only connects each item set in the header table with its corresponding first node in the tree to a queue, which reduces the time to search first node in the suffix tree of the current frequent sequence.Because of the construction of header table in each recursive procedure, performance of the BCWAP in space is between the PLWAP and the WAP.3 Performance analysises and experimental demonstrations on the improved algorithms. The contrast experiments and result analysises between the improved algorithms and the PLWAP, the WAP and the NGCWAP, separately in the aspects of time complexity and space complexity, demonstrate the accuracy and effectiveness of two improved algorithms.
Keywords/Search Tags:Web log, frequent sequence, web sequence pattern mining, WAP, PLWAP
PDF Full Text Request
Related items