Research On Parallel Analysis Technology Of Large Scale Web Log

Posted on:2017-11-28

Degree:Master

Type:Thesis

Country:China

Candidate:M L Shao

Full Text:PDF

GTID:2348330491464317

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Mining user's behavior patterns and access intentions by Web log analysis, which are widely used in web page recommendation and link structure optimization. With the growth of the log data scale, the scalable technology of log analysis would be the research direction of log analysis, Frequent pattern mining is the basic of log analysis, this paper focuses on the scalable technology for set frequent pattern and sequential frequent pattern mining. This paper achieves parallel data mining of frequent patterns of massive log by the MapReduce based on disk and the Spark based memory, solving the issues about partition of log data and load balance, as well as support counting of candidate data in a distributed environment, specific content includes:(1) A candidate based transaction recognition algorithm is proposed for transaction identification, the key stage of Web log data preprocessing. The main idea of algorithm is using space for time. Compared with the algorithm to build a user access tree, it saving the time cost of traversing the tree.(2) An approximate load balance parallel FP-Growth algorithm is proposed for set frequent pattern mining of Web log. The upper bound of maximum prefix path length of item is used to measure the workload of mining item condition pattern tree. The approximate value of the workload is used for the load grouping, and all nodes divide the database according to the grouping results in parallel. Compared with the full load balance parallel FP-Growth algorithm, it is not necessary to construct global FP-Tree, which eliminates the single point limitation in the process of data partitioning, and takes the load balancing data partitioning and the whole calculation process into account.(3) A parallel AprioriAll algorithm based on Spark is proposed for mining sequential frequent patterns in Web logs. First, data scanning in the iteration process can be directly carried out on the RDD in memory without having to scan the hard disk. Second, the intermediate results of calculation process can also be persistent to RDD, the next step can read data from the memory. Finally, the data partition project based on reduce-side join is proposed for the support counting of candidate data in distributed environment. Compared with the parallel AprioriAll algorithm based on MapReduce, the whole computing process saves a lot of disk IO and data shuffle.(4) Finally, through the experimental verification, the candidate based transaction identification method can effectively deal with the transaction processing of large scale log, the parallel FP-Growth algorithm with approximate load balancing has better performance and better stability, a parallel AprioriAll algorithm based on Spark has a better speedup and scalability.

Keywords/Search Tags:

Web log, Transaction identification, Frequent pattern, Parallelizaiton

PDF Full Text Request

Related items

1	Research And Realization Of Frequent Travel Pattern Discovery Algorithm For Mass Travel Data
2	A Study On Algorithms Of Weighted Frequent Pattern Mining
3	Research On Mining Closed Frequent Pattern In Data Streams
4	Researches On Algorithms For Mining Top-K Frequent Patterns
5	The Research On The Related Problems Of Association Rule Mining
6	The Research And Relization Of Mining Frequent Patterns On Business Data Straems
7	Constraint-Based Frequent Pattern Mining:Novel Applications And New Techniques
8	Study And Design On The Algorithms Of Mining Association Rules
9	Study On Bit Stream Oriented Unknown Frame Head Identification
10	The Research Of Association Rules Algorithm Based On Frequent Pattern Tree