Font Size: a A A

Research And Achievement Of The Search Strategic For The Topic Search Engine Spider

Posted on:2011-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:L XiaFull Text:PDF
GTID:2178360305985242Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Topic spider is an automatic program which is used to get data from web in the back end of topic search engine. The data inquired in the front end of topic search engine is snatched from internet net by spider firstly, and then they are stored in local disks. Finally, the data is extracted from cyber page and indexed.. As a result, we can say that a topic search engine is supported by topic spider. This paper proposes an algorithm of topic priority snatching which is based on the combination of linking text relevance algorithm and topic message inheritance renewing algorithm. This algorithm conducts the direction of spider and storages data by means of postgresql database cluster.According to the characteristics of the cyber page structure, the algorithm of topic priority snatching solves the problem of channel jamming and omits of capture by predicting the correlativity through delivering the theme among the pages. Firstly, a correlative information value is delivered according to the anchor text. If the information given by the anchor text is correlated, the correlative threshold will be delivered directly. Otherwise it will be multiplied by the genetic ratio before delivery. In the process of delivery, correlative information value may be reset to the initial value if it encounters the correlative web page. Relevance messages separates the web of different topics into different channels. All of webs that are related to topic are in the biggest channels, and every channel is connected with each other, with spider snatching by the size of channels.Because the net information to be snatched is so huge, one single computer can't storage it. This paper extends the storage capability of resource storeroom and linking address storeroom by using the technique of postgresql database cluster. It also makes use of pgbouncer pool in every database nodes, which can reduce time of database connection to save time. Cache is used in linking address storeroom to reduce times of database connections, which shortens the time period and improves rate of spider.At last, the feasibility of the topic spider which using technique of topic priority snatching is validated by experiment and data analysis.
Keywords/Search Tags:Spider, Search Engine, Postgresql, Theme Correlativity, Database Cluster
PDF Full Text Request
Related items