Data Mining And User Behavior Analysis Based On The Massive Query Log

Posted on:2014-02-15

Degree:Master

Type:Thesis

Country:China

Candidate:T T Zhou

Full Text:PDF

GTID:2248330398471579

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet and search engine technology, the information on the web is increasing rapidly. The search engine becomes the first choice of the majority of users to access to network information. In The process of interaction of users and the search engine, amount of query log is generated and it is still increasing rapidly. Since the log is full of information of user behaviors, it has been studied by companies to understand and attract users. The wide application of distributed technology has made storing and computing massive log quite convenient. Nowadays IT companies are putting more emphasis on their query log in hope to timely and accurately analyze and mine user behavior buried in the data, so that to improve user satisfaction with their search results and make the company more competitive.This paper takes the massive query log as the subject of study, and the main contribution are as follows:(1) Research the technology of log preprocessing, it includes Data Cleaning, User Identification, Session Identification, Path Completion, Transaction Indentification and related algorithms, then combined distributed computing with the algorithms to implement log preprocessing based on Hadoop for data mining.(2) Design a user log mining system which takes into account the characteristics of massive log and the problem that traditional methods are hardly applicable to user behavior analysis on search engine. To address this concern, this paper proposed a data mining method of massive log based on the MapReduce framework and builded the user behavior model according to query words, clicked URL and the user ID from the log, used eigenvectors to represent users and provided a formula to computing user similarity analyzed the feasibility of applying distributed computing techniques to K-means algorithm and realized the procedure. The evaluation shows the algorithm can effectively cluster users and has relatively well performance when dealt with massive data. (3) Analysis user behavior in terms of: the volumn of log, the number of usersand the relationship between them; keyword numbers, length and charactercomposition and frequent pattern; the number and the depth of clicked URL, the mostcommon URL; the correlation between the URL rank returned by search engine andthe sequence of user clicked. After multiple perspectives analysis, user behavior ischaracteirzed which provide reference for companies to improve search result anduser experience.

Keywords/Search Tags:

massive log, data mining, distributed, K-means, MapReduce

PDF Full Text Request

Related items

1	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
2	Researches On Large-Scale Mining Algorithms Of User Behaviors In Internet
3	Research Of Massive Data Processing In The Vessel Monitoring System
4	Research On K-Means Algorithm Based On MapReduce
5	The Desgin And Implementation Of A MAPREDUCE Based Distribute Programming Framework
6	Data Processing Of Complex Structured Data Based On MapReduce
7	The Research Of User Behavior Mining System And Implementation Based On The Massive Log
8	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
9	Research On Distributed Fast Clustering Algorithm Based On Mapreduce
10	The Research Of Data Mining Based On Clodd Platform