Font Size: a A A

Design And Implementation Of Hadoop Cluster Web Log Analysis System Based On Eucalyptus

Posted on:2017-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WangFull Text:PDF
GTID:2348330518494780Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the number of Web log has become more and more,the Web log contains a lot of information.Through the analysis of the log can get the value of information.In view of the current data quantity more and more Web log,the traditional single analysis processing ability has already achieved the bottleneck.Once the amount of data exceeds a certain size,the traditional rely on a single node computing power and can not meet the demand.In this paper,a Web log analysis system based on Hadoop is designed for Eucalyptus cluster.And the system is realized.The system uses cloud computing and distributed technology to analyze and process large scale Web logs.Results show that the system can greatly improve the computing power and running speed of the system.First,build a Eucalyptus private cloud platform.Combined with Eucalyptus cloud platform to facilitate the rapid creation of virtual machines and Hadoop cluster distributed processing advantages,Hadoop clusters deployed on the Eucalyptus cloud platform.Secondly,the use of MapReduce program on an online education website Web log analysis.Get the relevant indicators such as the number of visitors,the number of visitors,the number of IP,jump out rate,the average length of access,traffic sources,the number of pages and so on,and the results of the analysis presented to the user through the visual.In addition,the paper also uses the improved parallel Apriori algorithm to mining the association rules of Web log,and get the correlation between the various pages of the website.Website management and operations staff can better understand the web site through the log analysis results.According to the analysis of the results of the site structure adjustment,the implementation of effective marketing strategy for the user to make personalized recommendation and so on.Finally,the performance of the distributed environment and the single machine environment are tested and compared.The results show that the performance of processing a large number of Web log data in distributed environment is much higher than that of single machine environment.And the improved parallel Apriori algorithm and the single machine Apriori were tested and compared.The results show that the improved parallel Apriori algorithm has better performance in running time,CPU and memory utilization.
Keywords/Search Tags:cloud computing, Eucalyptus, Hadoop, Log analysis, MapReduce
PDF Full Text Request
Related items