Font Size: a A A

The Design And Implementation Of Massive Search Logs Analysis Platform Based On Hadoop

Posted on:2014-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2248330395999997Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the late20th century, with the acceleration of the growth of internet industry and informatization of human activities, the exchange of information is becoming more frequent. At the same time, the popularity of the Internet and technology platform for human information activities provides a very convenient condition to face an increase variety of web pages every day. The emergence of search engine technology to help people out of the maze of information, to help the millions of Internet users convenient to retrieve important information that greatly changed the way people work and live.Recently, with more attention are put in to the network user behavior, the search engine technology is not limited to itself anymore. This is because systematic study on net work user behavior doesn’t only help to capture the explicit needs of users; it can also help to discover their hidden needs.Another great challenge brought by the Internet age of information explosion was massive data processing, which was not only a great problem to traditional database server storage mode, but also a severe challenge to the server’s CPU,10calculation performance. Hadoop/Hive is very appropriate methods and tools in the field of technology to solve such problems.Based on the above situation, by reading and referring to relevant literature as well as analyzing on search logs and common log model, we design an analysis platform for processing massive search engine logs in this paper. There are four parts in the platform, which are:data acquisition and pre-processing module, data storage module, data mining analysis module and cluster management module. Among them, in the data mining analysis module, we put forward a user-behavior-based pattern mining algorithm to process and analyze search logs as well as the monitoring and management of the cluster in the monitoring module. In the platform,using the data mining process ideas as road map experimental mass data analysis tools Hadoop as platform, MapReduce of the map/reduce programming model as model, and with a simple and practical SQL-like Hive and HBase massive database to handle a large number of logs; At the same time, by matching Mining mode decomposition associated in each distributed server to get synthesis of mining results, and finally help to improve network and server performance bottlenecks and reflect the advantage of the Statute of the asynchronous mining and asynchronous data; The third step is to test the platform by setting up an experimental environment.The data used for testing are three samples of serarch logs (Sample Data, Daily Data, Monthly Data) provided by Sougou Lab. Based on the testing data, a detailed analysis on user search behavior is conducted by taking the below aspects into consideration:user query topic, user hits, URL sorting and user session analysis. Meanwhile, this paper also optimizes the performance of the platform and compares the system run-time before optimization to the run-time after optimization. The experimental data shows that the design of the log platform in this paper is stable and effective.
Keywords/Search Tags:Massive data, MapReduce, Log analysis, User behavior
PDF Full Text Request
Related items