Font Size: a A A

Researc And Implementatio Of The Web Spider Of The Subject-Oriented Search Engine

Posted on:2014-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:H S ChuFull Text:PDF
GTID:2248330398471588Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As a kind of information retrieval technology, search engine has wide range of applications in nowadays and a bright future and has become a new economic growth point. Search engine could be divided into two categories, which are traditional search engine and subject-oriented search engine. Traditional search engine, universal search engine in other words, is a kind of technology which realized by only searching the key words. Universal search engine could not meet the actual demand which increases gradually. Technology of subject-oriented search engine comes into being, and become one of the research focused on Internet industry.Compared with the search spider in universal search engine, web crawler in subject-oriented search spider could get the web with more targeted to specific category. There are usually two kind of method for the crawler in universal search engine to work. The first one is called vertical search spider, which means get the information widely without limit and then extract the desired links from the results. The links link to specific category and will be the ones for next web crawl. The other one will use the web links in specific category as the first batch of links which called seeds and get the web pages in certain order. At last the information crawled from the web pages will be transformed into structured information and stored in database.This paper is focused on the commencement of the following aspects:1. This paper combines a variety of relevant research in recent years, introduces the current search technology development and summaries the mainstream technology.2. This paper designs a new vertical search spider based on Bayes algorithm with new software point, makes need analysis with the feature of the subject-oriented search engine. Combined with Heritrix frame, this paper also makes detailed design as follows. The new system has some features such as high scalability and good coupling between modules and so on. By configuring the crawling rule of the vertical search spider, users could let the vertical search spider crawl the web page as their wish and get the structured information.3. Combined with the naive Bayes classification algorithm, the paper also preliminary modeling the classification model based Bayes algorithm and design the classification module. The paper realize the Bayer algorithm based text classifier at last.4. This paper realizes the subject-oriented web spider by code and by crawling the web page and analyzing the search results, it also tests and verifies the searching accuracy and practicality of the subject-oriented search engine.
Keywords/Search Tags:Bayes algorithm, Web spider, Heritrix, Java, Searchengine
PDF Full Text Request
Related items