Query Logs Analysis Based On Big Data Platform

Posted on:2018-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:S D Zhou

Full Text:PDF

GTID:2348330569986447

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Nowadays,due to the rapid development of Internet technology,the Internet information explodes rapidly.Search engine has become the main tool for people to get useful information from massive network information.The query log records the users' search behavior and reflects the users' interest.Research on the query log not only helps users to get information easier and more accurate,but also improves the search quality.Meanwhile,it has important significance on the area of search engine advertising accurate delivery,etc.However,research on the query log is faced great challenges because of its sparse,massive,non-standardized and unique.The emergence of popular vocabulary has enhanced this situation.This thesis mainly studies the large-scale web query log classification based on big data platform,including the user cluster analysis and the query intention classified.In the study of user clustering,in order to ease the sparse nature of the query log,this thesis try to extends the query by adding URL-Key,which is based on the query string and the users' clicking on the URL.To a certain extent,it enriches the feature of query.On this basis,Word2 Vec is used to construct the word vector model,and each user in the form of vector.Then we improve the Ren's similar K-Means hierarchical clustering method(SKHC)and implement this method for the user cluster analysis on the Spark platform.The experimental results show that the clustering algorithm proposed in this thesis is superior to the SKHC algorithm and the classic K-Means algorithm in the respect of user clustering analysis based on query log.In the study of the user's query intention classification,this thesis use the active learning algorithm to classified the query intention of query log,which is based on commission.The decision tree algorithm and naive Bayesian algorithm of the Spark machine learning repository MLlib are chosen to model the committee.After studying inquiry mechanism of the active learning algorithm based on commission,we found that the scale of sample set labeled by professor in inquiry mechanism of normalized entropy bagging query(nEQB)is large.And there are many similar samples labeled in the sample set,which is easy to cause problems such as repeated labeling.Therefore,we uses improved K-Means clustering to optimize the inquiry strategy and implement this strategy on the Spark platform.The experimental results show that the active learning classification algorithm proposed in this thesis has the similar performance with the active learning algorithm which adopts nEQB inquiry mechanism.But the active learning classification algorithm proposed in this thesis can reduces labor costs,and improves the efficiency of algorithm.

Keywords/Search Tags:

query log, Spark, active learning, user clustering, query classification

PDF Full Text Request

Related items

1	Astudy On The Methods Of Chinese Product Query Classification Based On User Behavior And Semantic Expansion
2	Design And Implementation Of A User Behavior System For Query Logs Based On Spark
3	Research On Key Techniques And Applications In Text Classification
4	Research On Personalized Query Based On User Behavior
5	Technology Researches Of Query Refinement Based On User Intent
6	An Ad-hoc Query Engine Based On Spark SQL
7	Research On Distributed Query Processing And Optimization Of RDF Data
8	Query Expansion Based On User Annotating Information
9	Research And Implementation Of Spatial Text Data Query Processing Technology
10	Design And Realization Of Optimized Query Strategy About Multi-Tenant Saas Based Application