Font Size: a A A

The Design And Implementation Of A Realtime And Distributed Web Log Analysis System

Posted on:2016-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q T ZhouFull Text:PDF
GTID:2308330473954393Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Web log is generated by the Web server data records, and it contains important information about website operation. We can get web page views, user behavior analysis, calculating the user search keywords ranking by log analysis, in order to achieve the digital operation of the enterprise. Usually a small or medium-sized website which the daily page views can be one million will produce more than 1 GB log files a day,for larger websites, may produce more than 10 GB log files per hour. When the log data is growing at 10 GB, 100 GB a day, using a single host have been unable to meet the demand of the computing and storage properties of the data. Therefore, the use of distributed computing and storage technology to complete the analysis of the Web log processing has become an inevitable trend.Hadoop has become the very popular distributed computing framework currently,and had been widely used in the log analysis field, data mining and other fields. The core of Hadoop is the Map Reduce parallel computing model and distributed storage system(HDFS), which helps to realize the management of the program, memory and storage resources. The essence of the process is a sort of after computing stream which has been enough to deal with massive data, but there are still some defects in terms of the performance:(1) The log metadata which are stored in HDFS firstly are read during computing, soit must cause calculation delay, and it is hard to ensure the time efficiency.(2) The HDFS will keep lots of the original log data which is a shocking waste ofmachine resources in consideration of redundancy.(3) It dose not have a good scalability. The users must realize the complicatedMap Reduce program, so it is hard to reuse and maintain the system.Therefore, this thesis proposes a new computing flow scheme to avoid the defects of hadoop, and design and implement a standardized web log analysis system, the concrete content includes:Firstly, make the definition of system according to the scene of log analysis. The user completes the configuration of log model through the system, and then the system generates comuputing tasks. The user read the results per minute in the form of statements.Secondly, design the architecture of the system, and optimize the computing and storage performance. The new computing flow is based on Map Reduce. It helps to reduce the delay of computing, improve the system recovery ability. Design a new store model according to the results which improves the efficiency of data retrieval.Thirdly, realize the function of the system according to the process, including log model management, data calculation, data storage, web UI. Describe the interaction of data formats and communication flow between the modules, and detailed scheduling and task execution.Fourthly, design two experiments to verify the function and performance of the system. Analyse the performance of this system through the statistics of the page view of an e-commerce site. And then the comparative analysis of the single host processing and parallel processing, test results show that the parallel processing show higher real-time performance than the single processing.This paper solved the problem of real-time and reliability of the process of massive log analysis through distributed computing system and storage scheme, and through the configuration of the user interface to complete the standardized management of log analysis, effectively improved the efficiency of the log analysis, it has the very strong practical significance and applications in the era of big data.
Keywords/Search Tags:Distributed computing, Log analysis, Hadoop, Map Reduce
PDF Full Text Request
Related items