Font Size: a A A

Research On Resource Quality Evaluation For Web Search

Posted on:2018-11-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1368330566488338Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the main entries to information resources on the Web,the size of both search index and search user volume have increased greatly in recent years.However,the quality of search resources varies in a large range.Previous work on quality evaluation for search engine resources mainly focus on Web spam detection.But the quality issues of search engine resources are not limited to Web spam.First,there is a large amount of lowquality data in the Web pages indexed by search engines,such as fraudulent telephone numbers,fake news,promotion information,and so on.It will greatly degrade user experience to present these low-quality information to search engine users.Secondly,to better satisfy user needs,search engines combine vertical resources(such as CQA portals)with traditional Web search results and return integrated result lists to users.If these search resources are manipulated to promote products,the overall quality of search engines will be affected.Thirdly,because of the implicit feedback mechanism of search engines,people may utilize click spam to promote the rankings of their Web pages in the search result list,which may lead to unsatisfying search result lists.Finally,since search engines incorporate the results of multiple vertical search resources,such as CQA portals,encyclopedia sites,and crowdsourcing platforms,even if all the search results are of high quality,it is still a problem how to choose from these heterogeneous search sources to improve users' search outcome and satisfaction for different search tasks.This paper focuses on quality evaluation for Web search scenarios.The main contributions are summarized as follows:Fraudulent information identification in search results: There is a large amount of low-quality content in general search results on SERPs.Take fraudulent support telephone number as an example,when a user searches for the official telephone number for a certain product,there are sometimes Web pages containing fraudulent support telephone numbers on the SERP.Since it is difficult for search users to find the desired result,it damages the interests of both merchants and Web users.This paper proposes a fraudulent support telephone number detection algorithm based on co-occurrence information.We construct a co-occurrence graph according to the co-occurrence relationships of the telephone numbers that appear on Web pages and diffuse the trust scores of seed official support phone numbers and the distrust scores of the seed fraudulent numbers on the co-occurrence graph to detect additional fraudulent numbers.Malicious promotion information identification in vertical search results: Various kinds of vertical search resources have been integrated on SERPs,which may also cause quality issues if they are manipulated by spammers.For example,some crowdsourcing systems such as “Sandaha” provide paid services to organize promotion campaigns on CQA portals.Thus,when CQA users post a question in the community,they may not receive their desired information but obtain multiple promotion answers.In this paper,we analyze the promotion campaigns in CQA and find that promoters need to rely on promotion channels(such as URL,telephone number,social media account,and so on)to connect to users,which are irreplaceable for promotion activities.Based on this finding,we propose a “channel-answer” bipartite graph propagation algorithm to detect promotion information in CQA.Search engine click spam detection based on bipartite graph propagation: Since search engines will adjust the rankings of search results according to the number of user clicks on them,people may utilize click spam to raise the rankings of their Web pages in the search result list.This causes problems in search result ranking and hurts user experience.With a deep analysis of click spam behaviors,we find that the patterns of spammers' search sessions differ from those of normal users' search sessions.We take into account both search actions and time interval between actions when modeling user sessions.Based on the modeled user sessions,we extract frequent sequential patterns and construct the “pattern-session” bipartite graph.Meanwhile,we obtain seed spam sessions based on the single click spam records and diffuse their spam scores on the bipartite graph to detect more spam sessions.Investigation of user search behavior while facing heterogeneous search services:There are results from heterogeneous search resources on search result lists.In face of different search tasks,how should users select from the high-quality search resources to improve their search outcomes and satisfaction? To answer this research question,we design multiple search tasks based on users' practical information needs and recruit subjects with different knowledge backgrounds to complete them.We provide a heterogeneous search environment,including a general search engine,a general CQA portal,and a specialized CQA portal.For each task,subjects need to perform searches to give an answer to the task,which will be evaluated by the corresponding expert who designed the task.After searching,we collect the search satisfaction feedback for each task from the subjects.Two major conclusions are drawn based on the analysis of the experimental results.First,CQA portals play an important role in users' search outcomes when they are performing complex tasks.The more frequently a searcher uses CQA portals to complete the task,the more likely he/she will come to a correct answer.Second,users' search satisfaction cannot be equivalent with their outcome.For many search tasks that users are satisfied with,their answers to the search tasks are not correct.
Keywords/Search Tags:Search Resource, Quality Evaluation, Spam Detection, Bipartite Graph Propagation, User Behavior Analysis
PDF Full Text Request
Related items