Research And Implementation Of Web Log Analysis System Based On Hadoop

Posted on:2019-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:F Y Chu

Full Text:PDF

GTID:2438330602458231

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The fast-developing internet technology has dramatically changed people's lives.When people enjoy the service in the Internet waves,they are also continually disseminating information to the network.Log data generated by users online is swelling.These data has great potential values,so how to mine valuable information quickly from huge log data is of great significance to the progress and development of human society.Log data is characterized by great data volume,wide distribution range and low value density.It is not easy to process and mine log data,which is shown as follows:At present,most of the traditional enterprise log analysis systems are still stand-alone,which can not to meet the demands for the storage and calculation of massive log data;how to improve the mining efficiency and reduce mining cost is also an urgent problem to be solved.In view of the above problems,this thesis proposes and designs a Web log analysis system based on Hadoop.The main research contents are as follows:1.This thesis introduces the background and significance of the topic,and expounds the current status of distributed and log mining.Hadoop technology including HDFS file system and MapReduce parallel comp uting framework are deeply studied.The Sqoop data migration tool and the Hive data warehouse are analyzed.2.The Web log mining theory and clustering algorithm are studied.This thesis analyses the traditional K-means algorithm,proposes an improved parallel K-means algorithm,and applies the improved algorithm to distributed Web log system to complete log clustering analysis.3.Using Hadoop platform to preprocess Web log data.It includes data cleaning,user identification,session identification,path supplement,and gives Map design and Reduce design of preprocessing function.4.It focuses on the implementation of the log analysis system.The function modules include log storage,log preprocessing,key indicator statistics,data display and log mining.Among them,the log preprocessing module is more important and has been implemented separately in Chapter 3.Log storage uses a combination of HDFS and MySQL,and the original data and cleaned data are stored in HDFS.The indicator statistics adopt Hive Sql,and the results of statistics are imported into MySQL storage by Sqoop,which is convenient for visualization.Log mining uses improved parallel K-means algorithm to cluster analysis of registered users.5.We Set up the system and analyze the experimental results.Experiments show that the Web log analysis system based on Hadoop achieves the functions of the system,completes the indicator statistics and visualization display;the improved parallel K-means algorithm can cluster registered users,improves the clustering efficiency,and can deal with large-scale log data mining and analysis by means of distributed system.

Keywords/Search Tags:

Hadoop, Web log, Data mining, Log analysis, K-means

PDF Full Text Request

Related items

1	Based On Hadoop Data Mining Algorithm Analysis And Research
2	Research And Implementation Of Distributed Web Log Analysis System Based On Hadoop Platform
3	Research On Web Log Data Analysis System Based On Hadoop
4	Research And Implementation Of Web Log Analysis System Based On Hadoop
5	Research On Algorithm Of Data Mining Based On Hadoop
6	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
7	Research On Spatial Data Mining Based On Hadoop
8	Study On Key Techniques Of Distributed Data Mining Based On Hadoop
9	The Research And Design Of Distributed Data Mining System Based On Hadoop
10	Research And Application Of Data Mining And Visualization Based On Hadoop