Font Size: a A A

Design And Implementation Of A Web Log Analytics Platform Based On Big Data And Machine Learning

Posted on:2021-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:X SuFull Text:PDF
GTID:2518306308976599Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet Web technology,the number of Internet users is growing exponentially.At the same time of Internet service users,a huge amount of web log information is generated,which constitutes a large amount of data and hides huge commercial resources and utilization value.At the same time,when browsing the Internet Web pages,explosive users take the initiative to query the information they need through their own experience and search,and often fail to get the desired target after tedious operation,as if they are immersed in the massive informationTherefore,big data technology and data mining solve this problem to a large extent.Based on big data technology and data mining technology,this paper mainly studies the following aspects:Big data and distributed technology are studied.I mainly studied Hadoop/Spark big data platform in depth.Google leads the trend in the era of big data.Hadoop,its big data distributed platform,has generated a complete ecosystem and been widely applied,among which MapReduce(MR)programming model and HDFS are most commonly used.Spark is a general parallel framework of Hadoop MapReduce like open source by UC Berkeley AMP lab(AMP lab of university of California,Berkeley).Spark has the advantages of Hadoop MapReduce.However,different from MapReduce--the intermediate output of the Job can be stored in memory,so that HDFS can no longer be read or written.Therefore,Spark is better suited for iterative MapReduce algorithms such as data mining and machine learning.The prediction model derived from NLP/Word2Vec technology based on deep learning is studied.First,word2vec can train efficiently on millions of dictionaries and hundreds of millions of data sets.Secondly,the training result of the tool--word vector(word vector)can well measure the similarity between words.In the process of log mining,we can choose to explore the similarity of logs by using the session sequence.For the generation method of the specific session sequence and the selection and training method of the context in the sequence,this paper,after studying the related word frequency weighting algorithm,makes an improvement on word2vec,so as to have the theoretical basis of similarity calculation and conclusion prediction for each log.The log analysis platform based on Spark/HDFS is designed in detail.Based on in-depth study and research on relevant big data distributed platforms and algorithms,this paper will involve a log mining and analysis platform based on Spark/HDFS.The platform consists of the following modules:log pretreatment module,log storage module,log mining module.The log pretreatment module is implemented by Spark platform.The log storage module is implemented using HDFS in Hadoop.The log mining module is implemented by the improved Word2Vec algorithm.As it is distributed processing,the algorithm flow is designed so that it can run on the distributed platform.Finally,the function and performance of the Web log analysis platform based on big data platform are tested.Through the comparison of the stand-alone system and other models,it is proved that the system has obvious advantages in processing the web log of big data.
Keywords/Search Tags:Hadoop/Spark, Web log, data mining, Word2Vec
PDF Full Text Request
Related items