Font Size: a A A

The Design And Implementation Of User Behavior Analysis System Based On Spark

Posted on:2019-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:L YinFull Text:PDF
GTID:2428330566470923Subject:Computer technology
Abstract/Summary:PDF Full Text Request
While the Internet brings opportunities and conveniences to government office,it also brings risks and challenges.Enterprises need to obtain user access behavior and interest preferences by analyzing network logs.However,single log processing system has long been unable to meet the expanding log data.So a large-scale data processing framework Hadoop is convenient for 2006.When the year came out,the subsequent Spark gradually replaced MapReduce as the default execution engine of Hadoop.Further,MLlib(Machine Learning Library),as an important part of Spark,covers a variety of commonly used intelligent processing algorithms and data tools.Using the mining algorithm in MLlib to carry out sequence pattern mining to get users' Internet behavior habits,it is helpful for managers to optimize web service and users' online experience.However,the data in the original web log is usually incomplete,redundant or even wrong.It is not ideal to use these data directly for analysis and mining,so the requirements for the collection and preprocessing of log data are higher.In order to solve the above problems,after the research on the needs of the user behavior analysis,a user behavior analysis system based on Spark and its machine learning library MLlib is designed and developed.The main research contents include:1.An extensible user behavior analysis system framework based on Spark/MLlib is built.The framework combines the functions of log collection,log preprocessing,data mining,visual display and system configuration management.Through the pre analysis and processing of the original log data collected by the log collection module,it generates the intermediate data that meets the requirements of the format,and then uses the data mining module to use the sequence pattern data.According to the comparison and analysis of user behavior sequence patterns,the users are classified and classified,and the results of the analysis are illustrated through the visualization module.2.A log collection and storage strategy based on hash map partition is designed.The storage strategy is based on the IP address or ID identity of the user.The partition information is obtained by hash and modular arithmetic,and then the log data is stored separately according to the partition information,and the corresponding Hive partition table is created.Experimental results show that this strategy can effectively improve the retrieval efficiency of large-scale log data.3.A data preprocessing algorithm based on mixed threshold session recognition is proposed.The algorithm achieves data preprocessing through data cleaning,user identification,session recognition and behavior interpretation.The session recognition part adopts a comprehensive strategy based on the combination of the dynamic session length and the access interval threshold,which improves the accuracy of the user session division,and provides effective data support for the subsequent data mining module.
Keywords/Search Tags:Spark, MLlib, Log Collection, Preprocessing, Data Mining, User Behavior Analysis
PDF Full Text Request
Related items