Font Size: a A A

Research And Implementation Of Distributed Web Log Analysis System Based On Hadoop Platform

Posted on:2018-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZengFull Text:PDF
GTID:2358330515453960Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology and the rapid development of the Internet,the Internet is more and more closely linked with people's lives.Web site will produce a large number of log records every day,people's access records are kept in the web log.Analysis of log data has become an important means to understand the operation of the site,the user access rules and other information,mining valuable information is conducive to enterprises to provide users with better convenient services.At present most of the log analysis system is stand-alone,in the face of massive web log data,both performance and storage capacity are not competent.In order to meet the needs of big data analysis,the emergence of a lot of data processing program,especially Hadoop as the representative of the cloud computing technology,distributed storage and powerful computing ability,provide a good platform for the storage and analysis of massive web log.Firstly,this paper introduces the development of distributed technology,and describes the background of web log mining.This paper studies the Hadoop core components HDFS and MapReduce,Hive data warehouse.This paper studies the principle of data storage in HDFS distributed file system,the access mode of data,the fault tolerance mechanism of the system and the programming model of MapReduee parallel computing framework.Then establish a suitable business data processing model for the web log analysis system,and design an efficient web log analysis system on the Hadoop platform.The system includes five modules:log storage,log collection,log preprocessing,key point index statistics,log mining.Log storage using HDFS and MySQL combination,HDFS storage of the original log and the cleaning of the log.The preprocessing of log is used to standardize the cleaning of data containing noise by MapReduce parallelization.The index is analyzed by using the HQL script of Hive data warehouse.Log mining uses the improved K-means algorithm in the MapReduce platform to analyze the registered users and improves the efficiency of the algorithm in dealing with massive data.Finally,through the system test,the system has greatly improved in collection,processing,storage,mining processing compared with traditional single Hadoop based on web log analysis,not only reduces the workload of developers but also enhance efficiency.
Keywords/Search Tags:Hadoop, web log, Log analysis, Data mining, K-means
PDF Full Text Request
Related items