Font Size: a A A

Research On Web Log Data Analysis System Based On Hadoop

Posted on:2019-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z S ShiFull Text:PDF
GTID:2428330572950212Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Internet users' visits will generate huge amounts of Web logs.These Web logs record every visit by users and also hide huge commercial information.Therefore,it is necessary to design an efficient Web log mining analysis technology.This paper carries on in-depth study of the storage,pre-processing and appropriate data mining algorithms of massive Web logs,proposes a set of algorithms which can access data quickly and efficiently,improve data validity of data preprocessing mechanism and implement reliable and efficient data mining,and implement a Web log analysis system based on these algorithms.This paper proposes a load balancing optimization algorithm based on load factor combination ranking to implement the storage and optimization of Web logs.This approach takes full advantage of the Hadoop distributed platform,achieve the goal of preventing access to “hot spots” when reading data in HBase,and improve the defects of HBase's own load balancing algorithm.The algorithm considers the CPU usage of the node,the number of read requests,and the read request response time,and adjusts the load according to the rank of the node.By transferring part of the Regions with high load to the low load nodes,balancing the frequency with which each node is visited,the load balance of the cluster is guaranteed to some extent.This article uses different data processing methods to effectively preprocess Web logs.Through the detailed research on the data cleaning,user identification and session identification process of Web logs,this paper conducts detailed Map and Reduce design to ensure the integrity and validity of the Web logs to be mined.Then this paper parallelizes the preprocessing method for the Hadoop platform which can effectively improve the efficiency of web log preprocessing.This paper uses a modified K-means algorithm based on the initial center point to implement cluster analysis and processing of Web log data.When the initial clustering center is selected,the algorithm increases the probability that a point far from the current clustering center becomes a new clustering center,so that each center point is dispersed as much as possible,thereby avoiding complete randomness in selection.According to the triangle decision principle of the Elkan K-means algorithm,unnecessary distance calculation in the clustering process is reduced,thereby improving the convergence speed and clustering accuracy of the algorithm.Finally,the algorithm is parallelized to make it suitable for Hadoop platform.This paper compares and tests the hot spot scenarios to verify the validity of the proposed load balancing algorithm.By comparing the K-means algorithm with the improved K-Means algorithm presented in this paper,the efficiency of the algorithm is verified.According to the design,this paper carries on the experiment through comparing the log processing speed under the single machine environment and the Hadoop environment.This experiment verifies the superiority of the Hadoop platform in processing the Web log data.
Keywords/Search Tags:Hadoop, Web logs, Data mining, Load balancing, Improved K-means algorithm
PDF Full Text Request
Related items