Font Size: a A A

Research On Vertical Search Engine

Posted on:2011-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2178360305983032Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the growing amount of information pages when searching for a particular information on general search engines accuracy and coverage is low, paid back the content is not detailed enough and too much noise, maintain a huge index library of web pages is especially difficult, in information collection and storage is facing severe challenges. Because there is more than general search engines disadvantages of vertical search engines offset the sortage and its greatest feature is the precise, accurate, deep.Main work of page as follows:1.Describing completily vertical search engine, including system architecture which including a web spider, indexer, crawler, and user interface and themes distributing features, inverted sort index creation and basic techniques such as chinese segmentating word.2.Analysising and studying the parsing and crawling web, and the subject of determination, collection and purification of the basic web and also so operating principle, optimize and implement an algorithm for elimination of duplicated pages.3.Implementing a small vertical search engine, Mainly implementing lucene development kit, Web spider to achieve resolution of the various types of documents, including text, html, Word, pdf and other formats, by parsing the document to extract the topic-related information, and the page achieve modules including the Chinese word segmentation, the indexer and the seacher.4.This page eliminate improving and implementeing web including duplicating content.it is a an improvement.for traditional feature-based words algorithm.As the result pages, reproduced led to such as the emergence of the same content at different web url, so there will be a lot of duplicate content, The improved algorithm of this paper use the main code and secondary code to achieve, the feature expression signature can reflect the page content and the convenience of calculation. Master code paragraph of text signature web page structure information, secondary code identify the contents of the web page, so that the text can use the structure and content of information to eliminate duplicate pages. The algorithm has greatly improved the efficiency of eliminating duplicate pages.
Keywords/Search Tags:Chinese segmentating word, lucene, feature series, elimination of duplicated pages
PDF Full Text Request
Related items