| Search engines have became one of the most important tools to access Web resources for many years.Search effectiveness evaluation has attracted a lot of attention from both industry and academia since search evaluation is essential to understand users’ experience and improve the quality of search.Search evaluation usually comes in two ways.The offline methods evaluate system performance by calculating evaluation metrics based on relevance judgments.The advantage of offline evaluation is that the relevance judgments could be easily reused.However,the fixed user behavior assumptions behind offline metrics may lead to failures on individual users.Online evaluation methods try to infer users’ preference from implicit or explicit feedback.Online methods do not rely on specific user behavior assumptions,but the evaluation results can hardly be reused.“Benefit-cost” framework is widely adopted in various evaluation methods.In offline evaluation,most evaluation metrics take both benefit and cost into consideration: the benefit is usually estimated by document relevance and the cost is encoded in decay functions.In online evaluation,a number of evaluation methods try to directly model users’ benefit and cost during search.Benefit and cost is also used to interpret users’ decisions when interacting with search engines.How,traditional benefit evaluation and cost estimation methods are facing a number of serious challenges:(1)The estimation of benefit based on document relevance actually ignores the difference of cognitive abilities between users.It can hardly evaluate what users have learnt during search.(2)Time is usually used as in indicator of users’ effort.However,we find that the users’ perceived time does not necessarily equal to the objective time which is measured by clock.A number of factors such as document relevance,duration length and temporal relevance may have an impact on perceived time.(3)Traditional evaluation metrics can hardly model the impact of mobile touch interactions and heterogeneous results on users’ benefit and cost.The major concern behind these challenges is how we can measure users’ benefit and cost by considering their cognitive factors and how we can improve search evaluation with benefit and cost.In this work,we try to tackle these problem from the following aspects:As for benefit evaluation,traditional evaluation methods usually estimate users’ utility with document relevance.We propose a search success evaluation framework based on machine translation model.In this framework,we formulate the search success evaluation problem as a machine translation evaluation problem: the ideal search outcome is considered as the reference while search outcome from individual users as the translation.Thus,we adopt machine translation evaluation metric,to evaluate the utility users have derived from search.Experimental results show that the proposed evaluation method well correlates with assessors’ judgments.As for cost estimation,we investigate the impact of document relevance,duration length,and temporal relevance on users’ time perception.We find that users tend to underestimate the time spent on relevant documents,and longer sessions.To the best of knowledge,this is the first study to investigate the impact of different factors on time perception in the context of Web search.Experiments based on real users’ behavior find that evaluation metrics with perceived time could better correlate with users’ satisfaction.As for mobile evaluation,we investigate the impact of mobile interactions and result presentation on users’ benefit and cost.We propose a new evaluation metric,HeightBiased Gain(HBG),which is calculated by summing up the product of gain distribution and discount factors that are both modeled in terms of result height.To evaluate the effectiveness of the proposed metric,we compare the agreement of evaluation metrics with side-by-side user preferences on a test collection composed of four mobile search engines.Experimental results show that HBG is better than all existing metrics in terms of agreement with side-by-side user preference. |