Font Size: a A A

Design And Implementation Of A User Behavior System For Query Logs Based On Spark

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhangFull Text:PDF
GTID:2438330605963799Subject:Engineering
Abstract/Summary:PDF Full Text Request
At present,it can be seen that the scale of users using search engine is gradually increasing as the network develops rapidly.Meanwhile,it brings the explosive growth for the generated log information.In terms of the value brought by massive search logs,it has attracted extensive attention of various search Internet enterprises.To occupy a leading position in the future market and seize the wealth brought by the data,what enterprises begin to do is to research and analyze the log information of users,so as to find out the search intention and interest preference of users,and mine their behavior characteristics.In this way,it can provide accurate and personalized services for users.However,enterprises face with the challenges from two aspects while dealing with massive the log information of users.On the one hand,it needs a large number of data mining algorithms with the in-depth analysis of user behavior.At the same time,it requires fast computing speed,low delay and high error tolerance in the analysis of real-time scene of user behavior.As for the traditional single machine operation,it is far from meeting the requirements of massive data processing.By carrying out a large number of iterative calculations and structured data flow processing on MapReduce cluster,it will generate a large number of delays,which also fails to meet the requirements of the system.On the other hand,it is the storage problem of massive search logs.Due to the fact that the traditional relational database has limited scalability,it can not meet the continuous growth demand of data storage.According to the analysis and research of the above problems,it designs a query log user behavior system based on Spark after reading a lot of relevant literature and the detailed analysis of user demand,which is divided into four modules,including log collection module,log storage module,log analysis module and log visualization module.In terms of the log collection module,it mainly uses the distributed log collection system of Flume to collect query logs of each server.As for the log storage module,it mainly stores the logs collected by Flume into HBase and Kafka.While the log analysis module plays the most important role,which is divided into real-time statistical analysis,offline data statistical analysis,and offline data mining analysis.For the realtime statistical analysis,it mainly adopts Structured Streaming to process the log information stored in Kafka cluster,so as to realize the statistics of real-time hot topics and total topics.For the offline data statistical analysis,it mainly adopts Spark SQL to process the offline data of Hive data warehouse and store the results in MySQL database.When it comes to the content of offline data statistical analysis,it mainly includes user keyword statistical analysis,user query log index analysis,Rank ranking and click times statistical analysis,as well as URL click ranking statistical analysis.Besides,naive Bayes and K-Means algorithm in MLlib library is used in offline data mining analysis to classify and cluster the query topics of users.In the log visualization module,it mainly adopts the ECharts chart and Spring Boot framework to visualize the results of the log analysis module.In this way,it is convenient for business personnel to clearly grasp the results of user behavior analysis.Through the design and implementation of user behavior analysis system,it can help carry out more efficient statistics of user behavior information and mine user behavior intention,so as to improve the market competitiveness of enterprises.
Keywords/Search Tags:Query logs, User behavior analysis, Spark, MLlib
PDF Full Text Request
Related items