Font Size: a A A

Research And Implementation Of Web Log Storage And Analysis System Based On Hadoop

Posted on:2019-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q C BanFull Text:PDF
GTID:2348330545455585Subject:Cryptography
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the information service on Web is more and more abundant.Mining the potential information of user access behavior in Web log is of great significance for Web site optimization,business expansion,user personalized service and so on.However,with the increasing number of Web log data,the storage of small files and mining association rules can be optimized when dealing with massive small log files.It is mainly embodied in the fact that.when storing large amounts of small files,cause merger obstruction without considering asynchronous.Meanwhile,the mining efficiency is affected by the dispersion of data when mining association rules.Based on Hadoop Distributed File System,this thesis researches and designs of a small file storage scheme based on Web log data characteristics,and proposes an improved algorithm of Web log mining based on association rule clustering.Meanwhile,this thesis realizes a high efficiency log mining system.The main research results are as follows:Firstly,because HDFS storage Web small files consume a large amount of memory and the reading efficiency is slow,this thesis put forward a strategy for monitoring the task queue of small files asynchronously and merger scheme based on the decoupling of file upload and file merge module.It saves more than 60%of the time for uploading and downloading small files.And the memory consumption of the main node is reduced by more than 40%to compare existing solutions.Secondly,put forward the FP-Growth algorithm based on clustering.It avoids Apriori algorithm repeated 10 requests and reduce the disadvantages of large memory consumption using FP-Growth to build the FP tree.In log association rules mining,the execution time is reduced by more than 50%,and the number of association rules is increased by more than 60%.Finally,in order to observe the relationships between pages and dynamically configure mining parameters,realization of Web log mining system first in the log upload process combines the symmetric encryption and digital authentication algorithm.This system guarantees the safety of the log in the transmission process.Combined with the improved HDFS storage structure,and mining algorithm rules,this system make users can set support and confidence and can real-time observe the task execution and the results of the implementation.
Keywords/Search Tags:HDFS, Web log mining, Clustering partitioning, FP-Growth algorithm
PDF Full Text Request
Related items