Font Size: a A A

Topic-Specific Crawling And Search Routing Research Based On Peer-to-Peer Network

Posted on:2011-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178360305483082Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the information on the Internet, search engine has been a necessary access to information effectively. For meeting some users'searching requirements on specialized subject and specific topic, topic-specific search engine is becoming more and more popular in the area of data mining and information retrieval.Based on studies on search engine's working principle and key technology, this thesis mainly analysis the feasibility and importance to implement a focused crawler in a Peer-to-Peer network environment. At the same time, it designs a general framework for the focused crawler—PPSpider based on Peer-to-Peer network, which has been developed and realized for corresponding experiments about crawling algorithm and optimization. Main contributions are as following:(1) Considering the difficulty in establishing topic model, the thesis creates its term vocabulary for each theme according to universal catalog classification, and then mines local documents on the peer in order to set up its vector space model. Through the relevance calculation between the peer's VSM and each term vocabulary's VSM, the topic model can be established.(2) In view of the issue that universal starting URLs always lose some edgy link or hidden web. We propose to choose some local seeds from each peer's logs which record the peer's topic trend. Including local seeds, the new staring URLs can not only cover more hot pages but also crawl web information on the edgy page. In another word, it increases PPSpider's coverage.(3) As to the low efficiency brought by each peer's frequent join and exit in the distributed network, the thesis put forward a mechanism named sortURL to optimize the crawling algorithm:sequence those URLs in the queue to be crawled through calculating the distance between URL's hash value and peer's hash value. The method decreases duplicated URLs and improve the crawling efficiency. AS sortURL mechanism has influence on discovery speed of topic-relevant page, the thesis considers an improved proposal sortURL-Depth. To be concluded, there is a broad application prospect for topic-specific search based on Peer-to-Peer network. As its core part, focused crawler under Peer-to-Peer network has always been a hot spot in this area. For this issue, the thesis proposes an idea of setting up topic model, designing and developing PPSpider to improve crawling algorithm. The experiment that mining peer's log to choose local seeds proves that the improved starting URLs can discover target page more quickly. At the same time, the results of another experiment show that sortURL mechanism can reduce the duplication rate with no changes of peer's throughput and overhead.
Keywords/Search Tags:Topic-specific Search, Peer-to-Peer Network, Focused Crawler, Starting URLs, Topic Model
PDF Full Text Request
Related items