Study Of The Web Search Engine-Related Technology

Posted on:2012-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:C Shi

Full Text:PDF

GTID:2218330368488570

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

With the continuous development of computer technology, information resources continues to expand, in order to find the information resources we need, a variety of information retrieval system came into being. Web search engine (such as google, Baidu, etc.) as a special information retrieval system, its unique position is that it searches Web pages for the entire range of Web resources, which makes Web search engine differ from the general information retrieval systems (such as document retrieval system). Because the number of Web resources on the Internet is very large and is constantly changing updating, the most important point is Web page document itself is a semi-structured or unstructured, which often contain navigation, advertising information, useless connections and content irrelevant to the Subject of the page, so its complexity is much higher than ordinary text document. General information retrieval systems (such as document retrieval systems) are mostly designed based on vector space model, and can not adapt to these characteristics of Web resources, which makes Web search engine appear very different from the general information retrieval systems based on vector space model in the works. This article focuses on indexing, query expansion, ranking of the Web pages to elaborate the differences between them.The main contents of this article are as follows:This article describes the structure of the Web search engine index, and for the Web page contains a large number of irrelevant information such as advertising, navigation and other problems which affect the efficiency of the indexing,this article gives the algorithm of web page pre-treatment and text extraction in addition to remove duplicate pages,the noise content and the noise link which can improve the efficiency of the search engine's index. In this paper,the algorithm of correlation Search is realized by combining with user interests and log digging on server-side.Traditional PageRank algorithm will appear"topic drift "phenomenon which can bring a lot of noise irrelevant information,this article propose PageRank algorithm based on correlation of the page subject. This algorithm is realized by judging the correlation between web page and query topic from hyperlink content of web page and user clicks. The algorithm allows those web pages which have a higher correlation with the uer query and a higer user clicks get a high PageRank value.This article presents an algorithm of automatic summarization,it is realized by calculating the weight of each sentence to get the centences which can express thematic content of the page best.so uers can get the contents of web documents theme intuitivily and quickly.

Keywords/Search Tags:

web search engine, indexing, query expansion, ranking of web pages

PDF Full Text Request

Related items

1	Research On Ranking And Query Expansion Based On Polysemy
2	Research On The Scheduling Strategy Of Meta Search Engine And Results Ranking Algorithm
3	Query Expansion Research In Personalized Intelligent Search Engine
4	Desing And Implementation Of Query Expansion Module In Search Engine
5	Improvements On An Algorithm For Ranking In Search Engine Based On Web Log Mining
6	Research On Key Techniques Of Vertical Search Engine Based On Lucene
7	Research And Implementation Of Meta Search Engine
8	Research On Intelligent Search Engine Based On Knowledge Database
9	Optimization Techology Study And Implementation Of Web Pages Ranking For Meta Search Engine
10	Research On Ranking Method Of Vertical Search Engine