Font Size: a A A

Study Of The Web Search Engine-Related Technology

Posted on:2012-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:C ShiFull Text:PDF
GTID:2218330368488570Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the continuous development of computer technology, information resources continues to expand, in order to find the information resources we need, a variety of information retrieval system came into being. Web search engine (such as google, Baidu, etc.) as a special information retrieval system, its unique position is that it searches Web pages for the entire range of Web resources, which makes Web search engine differ from the general information retrieval systems (such as document retrieval system). Because the number of Web resources on the Internet is very large and is constantly changing updating, the most important point is Web page document itself is a semi-structured or unstructured, which often contain navigation, advertising information, useless connections and content irrelevant to the Subject of the page, so its complexity is much higher than ordinary text document. General information retrieval systems (such as document retrieval systems) are mostly designed based on vector space model, and can not adapt to these characteristics of Web resources, which makes Web search engine appear very different from the general information retrieval systems based on vector space model in the works. This article focuses on indexing, query expansion, ranking of the Web pages to elaborate the differences between them.The main contents of this article are as follows:This article describes the structure of the Web search engine index, and for the Web page contains a large number of irrelevant information such as advertising, navigation and other problems which affect the efficiency of the indexing,this article gives the algorithm of web page pre-treatment and text extraction in addition to remove duplicate pages,the noise content and the noise link which can improve the efficiency of the search engine's index. In this paper,the algorithm of correlation Search is realized by combining with user interests and log digging on server-side.Traditional PageRank algorithm will appear"topic drift "phenomenon which can bring a lot of noise irrelevant information,this article propose PageRank algorithm based on correlation of the page subject. This algorithm is realized by judging the correlation between web page and query topic from hyperlink content of web page and user clicks. The algorithm allows those web pages which have a higher correlation with the uer query and a higer user clicks get a high PageRank value.This article presents an algorithm of automatic summarization,it is realized by calculating the weight of each sentence to get the centences which can express thematic content of the page best.so uers can get the contents of web documents theme intuitivily and quickly.
Keywords/Search Tags:web search engine, indexing, query expansion, ranking of web pages
PDF Full Text Request
Related items