Font Size: a A A

Data Mining And User Behavior Analysis Based On The Massive Query Log

Posted on:2014-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:T T ZhouFull Text:PDF
GTID:2248330398471579Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and search engine technology, the information on the web is increasing rapidly. The search engine becomes the first choice of the majority of users to access to network information. In The process of interaction of users and the search engine, amount of query log is generated and it is still increasing rapidly. Since the log is full of information of user behaviors, it has been studied by companies to understand and attract users. The wide application of distributed technology has made storing and computing massive log quite convenient. Nowadays IT companies are putting more emphasis on their query log in hope to timely and accurately analyze and mine user behavior buried in the data, so that to improve user satisfaction with their search results and make the company more competitive.This paper takes the massive query log as the subject of study, and the main contribution are as follows:(1) Research the technology of log preprocessing, it includes Data Cleaning, User Identification, Session Identification, Path Completion, Transaction Indentification and related algorithms, then combined distributed computing with the algorithms to implement log preprocessing based on Hadoop for data mining.(2) Design a user log mining system which takes into account the characteristics of massive log and the problem that traditional methods are hardly applicable to user behavior analysis on search engine. To address this concern, this paper proposed a data mining method of massive log based on the MapReduce framework and builded the user behavior model according to query words, clicked URL and the user ID from the log, used eigenvectors to represent users and provided a formula to computing user similarity analyzed the feasibility of applying distributed computing techniques to K-means algorithm and realized the procedure. The evaluation shows the algorithm can effectively cluster users and has relatively well performance when dealt with massive data. (3) Analysis user behavior in terms of: the volumn of log, the number of usersand the relationship between them; keyword numbers, length and charactercomposition and frequent pattern; the number and the depth of clicked URL, the mostcommon URL; the correlation between the URL rank returned by search engine andthe sequence of user clicked. After multiple perspectives analysis, user behavior ischaracteirzed which provide reference for companies to improve search result anduser experience.
Keywords/Search Tags:massive log, data mining, distributed, K-means, MapReduce
PDF Full Text Request
Related items