Font Size: a A A

Research Of Search Engine Web Spam Detection

Posted on:2012-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:D M ZhuFull Text:PDF
GTID:2218330338963037Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine spam refers to all deceptive actions which try to increase the ranking of a page in search engines. In recent years, the web spam techniques have become increasingly rampant, which make the results from search engines be greatly harmed. Identifying and preventing spam is deemed as one of the top challenges for web search engines. Developing efficient web spam detection algorithms is a promising research area. The main work and contribution come out of the thesis are:1. Web spam detection from the perspective of the website itself. Supervised machine learning-based web spam detection requires a lot of expensive labeled web pages. In order to solve the problem, a new semi-supervised learning algorithm named HFSSL (harmonic functions based semi-supervised learning) is proposed. It trains semi-machine learning algorithm in a weighted graph made up of labeled and unlabeled web pages. It makes full use of the information of the unlabeled pages. In the graph, web pages are connected with each other through the similarity between them, in order to avoid the problem of imprecision in semi-machine learning. The experiments show that the algorithm is effective in precision,recall and F-measure.2. Web spam detection from the perspective of search engine users. Search engine query log includes the interaction between search engine and the users. The clicked URL and the order of clicking reflect the preferences of the users. The contribution of this thesis is that it presents a modified dynamic Bayesian model named M-DBM, which is used to model the clicking actions in the log. M-DBM mines the causality between the URLs in the list returned by a search engine, to get the relatedness of the query and the URLs. Through this way, M-DBM gives the ranking of a web page and lowers the position of web spam pages .Experiments show that the proposed M-DBM outperforms other existing click models.
Keywords/Search Tags:search engine, Web spam, Semi-supervised learning, Harmonic functions, query log, click model, user behavior
PDF Full Text Request
Related items