The Design And Implementation Of Vertical Search Engine Based On Duplicated Web Pages Elimination

Posted on:2013-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:L L Zhao

Full Text:PDF

GTID:2248330371497272

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet, web pages have increased constantly in recent years. However, it is impossible for comprehensive search engine to retrieve all the web pages on the Internet, due to issues of storage, computing resources, bandwidth and so on. In order to satisfy usersâ€™requirements of a specific field, and to improve the relevance and accuracy of search engine, the study of vertical search engine begins. However, compared to comprehensive search engine, vertical search engine within its own particularity may crawl duplicated or near-duplicated web pages easily. Thus, existing duplicated web pages elimination strategies have some defects when applying to vertical search engines.Around the design and implementation of vertical search engine, first, the paper makes a brief introduction of the vertical searchâ€™s engine current research situation, and then analyzes the topic crawler and full-text retrieval in vertical search engine to provide a theoretical basis for the design and implementation of the system in the last section.On account of duplicated web pages elimination in vertical search engine, the reason and type of duplicated web pages, the meaning of removing duplicated pages, as well as the process and common algorithms of duplicated web pages elimination are introduced briefly first. After that, the shortages of these existing algorithms of duplicated web pages elimination in existing vertical search engine are pointed out, i.e. all of them ignore the particularity of vertical search engine, and donâ€™t use vertical search engineâ€™s own characteristics to remove duplicated web pages. Thus, this paper combines the topic crawler algorithm based on content and the duplicated web pages elimination algorithm based on content, and puts forward a duplicated web pages elimination strategy appropriating for vertical search engine, which can filter duplicated or near-duplicated web pages, and alleviate the search engineâ€™s burden of post process and index construction. This paper also attests the strategyâ€™s superiority proposed in the paper by several groups of related experiments.The design and implement of the vertical search engine in the last section applies the theories above. Basing on Solr Server, a vertical search engine with the theme relating to Chinese herbal medicines is designed. During the process of implement, this paper proposes a practically feasible method to acquire seed URLs and topic dictionary relating to Chinese herbal medicines, and improves the topic crawler of search engines by using Java technique, which can guarantee the engine filters duplicated or near-duplicated web pages when crawling web pages.

Keywords/Search Tags:

Vertical search engine, Topic crawler, Full-text retrieval, Duplicatedweb pages elimination

PDF Full Text Request

Related items

1	Vertical Search Engine Based Public Opinion Alert And Analysis Platform
2	A Vertical Search Engine In The Field Of News
3	Research And Implementation Of Vertical Search Engine
4	Design And Implementation Of Vertical Search Engine In The Field Of Medical Device
5	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
6	Design And Implementation Of Vertical Search Engine System For Recruitment
7	Research Of Intranet Information Supervision System Based On Net Crawler And Full-text Search Engine
8	Research And Realization Of Financial Topic Vertical Search Engine
9	Design And Implementation Of Vertical Search Engine Based On Web Crawler
10	PageRank Algorithm Based On Chinese Research And Application Of Vertical Search Engine