Research On Web Log Data Analysis System Based On Hadoop

Posted on:2019-07-30

Degree:Master

Type:Thesis

Country:China

Candidate:Z S Shi

Full Text:PDF

GTID:2428330572950212

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Internet users' visits will generate huge amounts of Web logs.These Web logs record every visit by users and also hide huge commercial information.Therefore,it is necessary to design an efficient Web log mining analysis technology.This paper carries on in-depth study of the storage,pre-processing and appropriate data mining algorithms of massive Web logs,proposes a set of algorithms which can access data quickly and efficiently,improve data validity of data preprocessing mechanism and implement reliable and efficient data mining,and implement a Web log analysis system based on these algorithms.This paper proposes a load balancing optimization algorithm based on load factor combination ranking to implement the storage and optimization of Web logs.This approach takes full advantage of the Hadoop distributed platform,achieve the goal of preventing access to �hot spots� when reading data in HBase,and improve the defects of HBase's own load balancing algorithm.The algorithm considers the CPU usage of the node,the number of read requests,and the read request response time,and adjusts the load according to the rank of the node.By transferring part of the Regions with high load to the low load nodes,balancing the frequency with which each node is visited,the load balance of the cluster is guaranteed to some extent.This article uses different data processing methods to effectively preprocess Web logs.Through the detailed research on the data cleaning,user identification and session identification process of Web logs,this paper conducts detailed Map and Reduce design to ensure the integrity and validity of the Web logs to be mined.Then this paper parallelizes the preprocessing method for the Hadoop platform which can effectively improve the efficiency of web log preprocessing.This paper uses a modified K-means algorithm based on the initial center point to implement cluster analysis and processing of Web log data.When the initial clustering center is selected,the algorithm increases the probability that a point far from the current clustering center becomes a new clustering center,so that each center point is dispersed as much as possible,thereby avoiding complete randomness in selection.According to the triangle decision principle of the Elkan K-means algorithm,unnecessary distance calculation in the clustering process is reduced,thereby improving the convergence speed and clustering accuracy of the algorithm.Finally,the algorithm is parallelized to make it suitable for Hadoop platform.This paper compares and tests the hot spot scenarios to verify the validity of the proposed load balancing algorithm.By comparing the K-means algorithm with the improved K-Means algorithm presented in this paper,the efficiency of the algorithm is verified.According to the design,this paper carries on the experiment through comparing the log processing speed under the single machine environment and the Hadoop environment.This experiment verifies the superiority of the Hadoop platform in processing the Web log data.

Keywords/Search Tags:

Hadoop, Web logs, Data mining, Load balancing, Improved K-means algorithm

PDF Full Text Request

Related items

1	Research On The Vertical FP-growth Mining Algorithm Based On Hadoop With Load Balancing
2	Research On Parallel Association Rule Mining Algorithm Based On Hadoop Platform
3	Research On Load Balancing Algorithm For Scheduling Based On Hadoop
4	Research And Improvement Of Load Balancing Optimization Under The Hadoop Platform
5	Research And Implementation Of Load Balancing Algorithm For Offline Data Migration
6	Research On Energy-aware Load Balancing In Heterogeneous Hadoop Cluster
7	Design And Implementation Of Video Logs Analysis System Based On Hadoop
8	Based On Hadoop Data Mining Algorithm Analysis And Research
9	Research On Algorithm Of Data Mining Based On Hadoop
10	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine