Font Size: a A A

Research On Parallel Analysis Technology Of Large Scale Web Log

Posted on:2017-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:M L ShaoFull Text:PDF
GTID:2348330491464317Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Mining user's behavior patterns and access intentions by Web log analysis, which are widely used in web page recommendation and link structure optimization. With the growth of the log data scale, the scalable technology of log analysis would be the research direction of log analysis, Frequent pattern mining is the basic of log analysis, this paper focuses on the scalable technology for set frequent pattern and sequential frequent pattern mining. This paper achieves parallel data mining of frequent patterns of massive log by the MapReduce based on disk and the Spark based memory, solving the issues about partition of log data and load balance, as well as support counting of candidate data in a distributed environment, specific content includes:(1) A candidate based transaction recognition algorithm is proposed for transaction identification, the key stage of Web log data preprocessing. The main idea of algorithm is using space for time. Compared with the algorithm to build a user access tree, it saving the time cost of traversing the tree.(2) An approximate load balance parallel FP-Growth algorithm is proposed for set frequent pattern mining of Web log. The upper bound of maximum prefix path length of item is used to measure the workload of mining item condition pattern tree. The approximate value of the workload is used for the load grouping, and all nodes divide the database according to the grouping results in parallel. Compared with the full load balance parallel FP-Growth algorithm, it is not necessary to construct global FP-Tree, which eliminates the single point limitation in the process of data partitioning, and takes the load balancing data partitioning and the whole calculation process into account.(3) A parallel AprioriAll algorithm based on Spark is proposed for mining sequential frequent patterns in Web logs. First, data scanning in the iteration process can be directly carried out on the RDD in memory without having to scan the hard disk. Second, the intermediate results of calculation process can also be persistent to RDD, the next step can read data from the memory. Finally, the data partition project based on reduce-side join is proposed for the support counting of candidate data in distributed environment. Compared with the parallel AprioriAll algorithm based on MapReduce, the whole computing process saves a lot of disk IO and data shuffle.(4) Finally, through the experimental verification, the candidate based transaction identification method can effectively deal with the transaction processing of large scale log, the parallel FP-Growth algorithm with approximate load balancing has better performance and better stability, a parallel AprioriAll algorithm based on Spark has a better speedup and scalability.
Keywords/Search Tags:Web log, Transaction identification, Frequent pattern, Parallelizaiton
PDF Full Text Request
Related items