| Search engine is the core of internet retrieval technology. With the rapid development of internet, the comprehensive search engine can not meet people's needs any more. It promotes the booming development of vertical search engine. It is a more targeted search engine, but it still has many faults. It only search the information of specific topic and doesn't eliminate the influence of the web spam. This paper aiming at the research of the vertical search engine witch can combat the web spam.Firstly, it must eliminate the web spam's impact on the result in order to improve the precision of search engine. Cloaking technology is widely applied in all kinds of web pages now, which leads to the greatly decreasing of search precision, while most of the web pages is HTML pages. We design an algorithm to detect the web spam through research various forms of the page hinding.and analyze the source code of HTML documents.Secondly, research the PageRank algorithm of search engine, PageRank is an algorithm to calculate the importance of a page and also is a criterion to evaluate a web site is good or bad. It lead to the unfair rank because of the influence of web spam to the last rank result is not be considered in the process of calculate Pr value. To rank fairly, this paper modify the PageRank through evaluate different web spam, transfer the low value to web spam and the pages relate to it.Lastly, construct vertical search engine which can combat the web spam by using Hritrix, Lucene, Nutch and some else open source tools. First of all, acquire pages from internet, then detect web spam, build index file behind eliminate a part of web spam, using the improved PageRank algorithm in process of page rank in order to the result more fair. Finally, there will be an experiment use this search engine, check the search result through comparison. |