Font Size: a A A

Nutch-based Theme Reptiles And Implementation

Posted on:2008-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:X K SuFull Text:PDF
GTID:2208360212486536Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine is a system collecting and collating the Web information resource, and then supplying the inquiry. Actually it mainly searches for the Web information automatically and classify, index and store them into database so that the information can be submit to user by means of inquiry. It has presented the unprecedented challenge to the general search engine when the Web information grows rapidly. More and more people hope get necessary information as possible as quickly and efficiently. The search engine based-on topic aims at building some special Web information resource storehouse on a specific subject and stresses getting the correlative subject pages, adopting certain mechanism to filter out the non-correlated pages. When sorting inquiry result, a highly correlated page should be given higher priority.Based on the principle of the open-source Nutch this thesis realizes a search engine based-on topic that takes searching the pre-determined subject correlation page as a goal, not like the general search engine to collect and index all the available page. They can avoid visit other non-correlated pages, simultaneously can save hardware and network resources. The basic thought of the topical search engine based-on Nutch is judging and analysising the fetched pages before indexing it according to the characteristic dictionary which cab be obtained in the process of training. If it belongs to the pre-defined topic the page will be retained in order to set up the index soon after. If not belonging to the pre-defined topic it should be abandoned to avoid taking more space. This paper mainly grasps how to design and realize a search engine relevant to the specific topic, It has made the improvement on the page analysis, topic distinguished aspects algorithm and so on.
Keywords/Search Tags:Nutch—0.7.1, topical crawler, Chinese Lexical Analysis, training text, distance classification
PDF Full Text Request
Related items