Design And Implementation Of A Web Log Analytics Platform Based On Big Data And Machine Learning

Posted on:2021-01-18

Degree:Master

Type:Thesis

Country:China

Candidate:X Su

Full Text:PDF

GTID:2518306308976599

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet Web technology,the number of Internet users is growing exponentially.At the same time of Internet service users,a huge amount of web log information is generated,which constitutes a large amount of data and hides huge commercial resources and utilization value.At the same time,when browsing the Internet Web pages,explosive users take the initiative to query the information they need through their own experience and search,and often fail to get the desired target after tedious operation,as if they are immersed in the massive informationTherefore,big data technology and data mining solve this problem to a large extent.Based on big data technology and data mining technology,this paper mainly studies the following aspects:Big data and distributed technology are studied.I mainly studied Hadoop/Spark big data platform in depth.Google leads the trend in the era of big data.Hadoop,its big data distributed platform,has generated a complete ecosystem and been widely applied,among which MapReduce(MR)programming model and HDFS are most commonly used.Spark is a general parallel framework of Hadoop MapReduce like open source by UC Berkeley AMP lab(AMP lab of university of California,Berkeley).Spark has the advantages of Hadoop MapReduce.However,different from MapReduce--the intermediate output of the Job can be stored in memory,so that HDFS can no longer be read or written.Therefore,Spark is better suited for iterative MapReduce algorithms such as data mining and machine learning.The prediction model derived from NLP/Word2Vec technology based on deep learning is studied.First,word2vec can train efficiently on millions of dictionaries and hundreds of millions of data sets.Secondly,the training result of the tool--word vector(word vector)can well measure the similarity between words.In the process of log mining,we can choose to explore the similarity of logs by using the session sequence.For the generation method of the specific session sequence and the selection and training method of the context in the sequence,this paper,after studying the related word frequency weighting algorithm,makes an improvement on word2vec,so as to have the theoretical basis of similarity calculation and conclusion prediction for each log.The log analysis platform based on Spark/HDFS is designed in detail.Based on in-depth study and research on relevant big data distributed platforms and algorithms,this paper will involve a log mining and analysis platform based on Spark/HDFS.The platform consists of the following modules:log pretreatment module,log storage module,log mining module.The log pretreatment module is implemented by Spark platform.The log storage module is implemented using HDFS in Hadoop.The log mining module is implemented by the improved Word2Vec algorithm.As it is distributed processing,the algorithm flow is designed so that it can run on the distributed platform.Finally,the function and performance of the Web log analysis platform based on big data platform are tested.Through the comparison of the stand-alone system and other models,it is proved that the system has obvious advantages in processing the web log of big data.

Keywords/Search Tags:

Hadoop/Spark, Web log, data mining, Word2Vec

PDF Full Text Request

Related items

1	Research And Application Of Data Mining Technology Based On Spark In ERP System
2	Research And Application On The Parallel Algorithm In Big Data Mining
3	Design And Implementation Of Weibo Data Mining System Based On Hadoop Platform
4	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment
5	The Research And Implementation Of Bayesian Classification Algorithm In The Text Based On Spark Platform
6	Rock Image Clustering Analysis Algorithm Research Based On Spark
7	Agricultural Product Price Analysis And Forecast System Design Based On Hadoop+Spark Platform
8	The Design And Implementation Of Data Mining System On Yarn
9	Parallel Data Mining Algorithm Research In Cloud
10	Analysis And Research On Energy Consumption Of Public Buildings Based On Hadoop