Font Size: a A A

The Design And Implementation Of Vertical Search Engine Based On Duplicated Web Pages Elimination

Posted on:2013-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhaoFull Text:PDF
GTID:2248330371497272Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet, web pages have increased constantly in recent years. However, it is impossible for comprehensive search engine to retrieve all the web pages on the Internet, due to issues of storage, computing resources, bandwidth and so on. In order to satisfy users’requirements of a specific field, and to improve the relevance and accuracy of search engine, the study of vertical search engine begins. However, compared to comprehensive search engine, vertical search engine within its own particularity may crawl duplicated or near-duplicated web pages easily. Thus, existing duplicated web pages elimination strategies have some defects when applying to vertical search engines.Around the design and implementation of vertical search engine, first, the paper makes a brief introduction of the vertical search’s engine current research situation, and then analyzes the topic crawler and full-text retrieval in vertical search engine to provide a theoretical basis for the design and implementation of the system in the last section.On account of duplicated web pages elimination in vertical search engine, the reason and type of duplicated web pages, the meaning of removing duplicated pages, as well as the process and common algorithms of duplicated web pages elimination are introduced briefly first. After that, the shortages of these existing algorithms of duplicated web pages elimination in existing vertical search engine are pointed out, i.e. all of them ignore the particularity of vertical search engine, and don’t use vertical search engine’s own characteristics to remove duplicated web pages. Thus, this paper combines the topic crawler algorithm based on content and the duplicated web pages elimination algorithm based on content, and puts forward a duplicated web pages elimination strategy appropriating for vertical search engine, which can filter duplicated or near-duplicated web pages, and alleviate the search engine’s burden of post process and index construction. This paper also attests the strategy’s superiority proposed in the paper by several groups of related experiments.The design and implement of the vertical search engine in the last section applies the theories above. Basing on Solr Server, a vertical search engine with the theme relating to Chinese herbal medicines is designed. During the process of implement, this paper proposes a practically feasible method to acquire seed URLs and topic dictionary relating to Chinese herbal medicines, and improves the topic crawler of search engines by using Java technique, which can guarantee the engine filters duplicated or near-duplicated web pages when crawling web pages.
Keywords/Search Tags:Vertical search engine, Topic crawler, Full-text retrieval, Duplicatedweb pages elimination
PDF Full Text Request
Related items