Web Log Mining Technology Research Based On Hadoop/MongoDB

Posted on:2015-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:F Xiao

Full Text:PDF

GTID:2428330488499627

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The world is moving into an era of rapid development of Internet.For Internet service providers,the log data is a dig treasure library.Because these logs recorded the performance data,such as service response time-consuming data.Application providers are able to guide its reasonable adjustment application architecture,content,by mining these data,making it possible to provide more targeted business services,better user experience,so seize the competition in the Internet industry opportunities.But modern Web log data is generally very large,traditional stand-alone processing and analysis of the way has been difficult to adapt to the current big data processing.However,in recent years,the birth of cloud computing and the rapid development of massive mining Web log information indicates a new way.This paper mainly studies Web log mining technology based on Hadoop and one of the NoSQL technology,MongoDB.The research and achievement of this paper list as follows:First,This paper analyses Web log mining technology and the development of cloud computing technology and the research status of Web log mining,to determine the research direction;introduces some basic theory of Web log mining techniques,analyses the storage location of the Web log files,log type,as well as common log analysis preprocessing mechanism.Second,This paper introduces Hadoop platform architecture,described its framework,MapReduce programming model and its performance advantages for large data processing.Meanwhile,the paper also carried out a detailed study of MongoDB,the basic content of MongoDB,analyze the pros and cons of unstructured databases with traditional structured database and MongoDB performance advantages for unstructured data.Third,This paper researches and analysis's Apriori algorithm,which is one of commonly used Web log mining algorithms.Apriori algorithm,present by Agrawal,is a basic algorithm to search frequent item sets to generate the necessary Boolean association rules.But the algorithm itself scans the overall database frequently,which always leads to I/O bottleneck.To solve this problem,we propose an improved algorithm AprioriHM,which runs a distributed Apriori algorithm to reduce the overall scan of the database.As we run Apriori algorithm scans the transaction database,overall itemsets is divided into many pieces according to the irrelevant and weak correlation by AprioriHM.The algorithm AprioriHM reduces the difficulty of implementing distributed mining algorithm by using Hadoop to manage the details of parallel processing,eases I/O bottleneck by using a distributed efficient database,MongoDB.Fourth,Use Hadoop and MongoDB to build an experimental environment,using AprioriHM algorithm and the original algorithm comparative experiments;Experimental results show that compared with the traditional Apriori algorithm,AprioriHM stands out and has good scalability.

Keywords/Search Tags:

Hadoop, MongoDB, Web log, Data mining, Apriori algorithm, Log mining

PDF Full Text Request

Related items

1	Research On A Parallel Data Mining Algorithm Apriori
2	The Improved Apriori Algorithm Based On Hadoop Calculation Model
3	Research And Improvement Of Apriori Algorithm Based On Hadoop
4	Research Of Parallelized Distributed Association Rules Mining Algorithm Based On Hadoop
5	Research And Application Of Improved Apriori Algorithm On Hadoop
6	Research On Parallel Data Mining Based On Hadoop
7	Research On Parallel Data Mining Algorithms Based On Hadoop
8	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop
9	Research Of Mining Key Technology For EMU Fault Data
10	Research On Improvement Of Apriori Algorithm Based On Hadoop Platform