Font Size: a A A

The Research Of Query Classification And Clustering Based On Word Embedding

Posted on:2016-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:H B YangFull Text:PDF
GTID:2308330461974013Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, digital information on the Internet is experiencing exponential growth. So it is increasingly difficult to access specific need from the ocean of information. The web search engine, which is designed to retrieve information on the World Wide Web, is indispensable for helping the user to acquire the new knowledge, news etc. The search behavior can reflected the users’ interests and needs, directly and indirectly. Search queries provided by users are the most important of the search behaviors. Mining and analyzing queries is fundamental technologies for search engines, such as online advertisement, search engine optimization and personalization of applications. And query topic classification and clustering are the most widely used query mining technologies.Most queries are short and ambiguous thus few keywords feature can be extracted per query. And large training corpus is required for training text classifier using supervised machine learning algorithm. Constructing such text corpus manually is time consuming.Different from previous works, there are four improvements in the query topic classification and clustering algorithms proposed as below.1) Word embedding is first introduced to query feature extraction and representation. As web queries are typically short, yielding few features per query. Word embedding is introduced to curve this feature sparseness problem which can improve query topic classification and clustering performance significantly. Even more, query search logs are required to training word embedding without any extra text corpus.2) A novel algorithm named CT-Word2Vec, based on Word2Vec algorithm, is proposed. This paper incorporate web user search behaviors information into word embedding learning process. We show that CT-Word2Vec outperformance Word2Vec, Bag-of-Word algorithms on query topic classification and clustering tasks.3) Another algorithm named Topic-Word2Vec algorithm is proposed to learning topic-oriented word embedding, which incorporate query topic information into word embedding learning process. Experiment show that Topic-Word2Vec can enhance word embedding topic discrimination thus improves classification performance.4) A new query annotation method is proposed which is based on CT-Word2Vec and clustering algorithms. The quality and quantity of training data directly affect the performance of classifier. But annotate millions of query is time consuming and expensive. Experiment shows that our method outperform exists methods on precision and recall measures.
Keywords/Search Tags:Query, Word Embedding, Word2Vec, Classification, Clustering, Topic
PDF Full Text Request
Related items