Research And Achievement Of The Search Strategic For The Topic Search Engine Spider

Posted on:2011-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:L Xia

Full Text:PDF

GTID:2178360305985242

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Topic spider is an automatic program which is used to get data from web in the back end of topic search engine. The data inquired in the front end of topic search engine is snatched from internet net by spider firstly, and then they are stored in local disks. Finally, the data is extracted from cyber page and indexed.. As a result, we can say that a topic search engine is supported by topic spider. This paper proposes an algorithm of topic priority snatching which is based on the combination of linking text relevance algorithm and topic message inheritance renewing algorithm. This algorithm conducts the direction of spider and storages data by means of postgresql database cluster.According to the characteristics of the cyber page structure, the algorithm of topic priority snatching solves the problem of channel jamming and omits of capture by predicting the correlativity through delivering the theme among the pages. Firstly, a correlative information value is delivered according to the anchor text. If the information given by the anchor text is correlated, the correlative threshold will be delivered directly. Otherwise it will be multiplied by the genetic ratio before delivery. In the process of delivery, correlative information value may be reset to the initial value if it encounters the correlative web page. Relevance messages separates the web of different topics into different channels. All of webs that are related to topic are in the biggest channels, and every channel is connected with each other, with spider snatching by the size of channels.Because the net information to be snatched is so huge, one single computer can't storage it. This paper extends the storage capability of resource storeroom and linking address storeroom by using the technique of postgresql database cluster. It also makes use of pgbouncer pool in every database nodes, which can reduce time of database connection to save time. Cache is used in linking address storeroom to reduce times of database connections, which shortens the time period and improves rate of spider.At last, the feasibility of the topic spider which using technique of topic priority snatching is validated by experiment and data analysis.

Keywords/Search Tags:

Spider, Search Engine, Postgresql, Theme Correlativity, Database Cluster

PDF Full Text Request

Related items

1	The Theme Of The Search Engine Web Spider Search Strategy Study
2	Research And Design On Key Technologies Of Vertical Search Engine Oriented Soybean Theme
3	The Research Of P2P Search Engine Based On A Subject-oriented
4	Research On A Specialized Search Engine Based On Web Community Recognition
5	Design And Implementation Of A Spider For Topic-Specific Search Engine
6	The Theme (topical) Crawler And Its Applications - Theme Search Engine
7	Design And Implementation Of Intranet Search Engine In Court
8	Research And Implementation Of Data Acquisition Technology Of Vertical Search Engine
9	The Study On Search Strategy And Algorithm Design Of Theme Search Engine
10	The Research And Realization Of Tour Guide Information Vertical Search System