Font Size: a A A

Query Logs Analysis Based On Big Data Platform

Posted on:2018-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:S D ZhouFull Text:PDF
GTID:2348330569986447Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,due to the rapid development of Internet technology,the Internet information explodes rapidly.Search engine has become the main tool for people to get useful information from massive network information.The query log records the users' search behavior and reflects the users' interest.Research on the query log not only helps users to get information easier and more accurate,but also improves the search quality.Meanwhile,it has important significance on the area of search engine advertising accurate delivery,etc.However,research on the query log is faced great challenges because of its sparse,massive,non-standardized and unique.The emergence of popular vocabulary has enhanced this situation.This thesis mainly studies the large-scale web query log classification based on big data platform,including the user cluster analysis and the query intention classified.In the study of user clustering,in order to ease the sparse nature of the query log,this thesis try to extends the query by adding URL-Key,which is based on the query string and the users' clicking on the URL.To a certain extent,it enriches the feature of query.On this basis,Word2 Vec is used to construct the word vector model,and each user in the form of vector.Then we improve the Ren's similar K-Means hierarchical clustering method(SKHC)and implement this method for the user cluster analysis on the Spark platform.The experimental results show that the clustering algorithm proposed in this thesis is superior to the SKHC algorithm and the classic K-Means algorithm in the respect of user clustering analysis based on query log.In the study of the user's query intention classification,this thesis use the active learning algorithm to classified the query intention of query log,which is based on commission.The decision tree algorithm and naive Bayesian algorithm of the Spark machine learning repository MLlib are chosen to model the committee.After studying inquiry mechanism of the active learning algorithm based on commission,we found that the scale of sample set labeled by professor in inquiry mechanism of normalized entropy bagging query(nEQB)is large.And there are many similar samples labeled in the sample set,which is easy to cause problems such as repeated labeling.Therefore,we uses improved K-Means clustering to optimize the inquiry strategy and implement this strategy on the Spark platform.The experimental results show that the active learning classification algorithm proposed in this thesis has the similar performance with the active learning algorithm which adopts nEQB inquiry mechanism.But the active learning classification algorithm proposed in this thesis can reduces labor costs,and improves the efficiency of algorithm.
Keywords/Search Tags:query log, Spark, active learning, user clustering, query classification
PDF Full Text Request
Related items