Crawling Search Strategy Subject-oriented Research And Realized

Posted on:2013-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:M X Wang

Full Text:PDF

GTID:2218330374961938

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The rapid development of the Web makes the information on it more and more. Infinite information makes the traditional general-purpose search engine technology encounter a lot of problems such as low coverage rate, resource-consuming, slow updating and low satisfaction rate etc. To overcome the deficiencies of the general purpose search engine to meet the query needs of the specific users in specific areas, vertical search engine, or specific topic oriented search engine was developed which can provide search service with more detailed, more precise classification, more comprehensive data, and which has become new direction of search engine.Focused crawling technology is based on the traditional search enginetechnology where the crawler crawles the entire websites, applys the learning technology of machine learning to control the grasping objects of crawler, and makes crawling program to grasp as many as possible the website related to the designated topicsNowadays, the researches of topic focused crawling are mainly concentrated on two hotspots. One is how to automatically categrate website text, that is, how to determine whether a website relate to the designated topic; another is what crawling strategy can download as many as possible website, avoid downloading the websites that do not related to topics, and improve the coverage rate of designated topics.In this paper, we analyzed the search strategy of web crawler which is the main technology of focused crawling, the distribution of specific topic websites on Web, and the algorithms for determining the relevance. Based on above analysis, we proposed a subject-oriented web crawler frame whose main modules are detailedly introduced later. We then realized a subject-oriented web crawler based on Weblech. The idea behind this crawler is using the corpus training a classification algorithm to get a Naive Bayesian classifier. Analyze the grasped websites, store them for building index if they are topic relative, and discard them if not. The implementation is sample. It improved the crawling speed and recall rate.In this dissertation, we discussed related technology for implementing a focused crawler based on Weblech and Naive Bayes classifier. The main research works are summarized as following:(1) A new web search strategy is proposed. The main thought behind this strategy is that divide websites into Hub type websites and content type websites. If the crawler encounters a Hub type website, just neglects it so that the grasping process is more efficient and that the tunnel phenomenon can be solved. As a result, the coverage rate and relevancy of the crawler are improved.(2) We Studied the principle of naive Bayesian classification algorithm and vector space model. We proposed a new classifying way which uses weighted feature of LDA topic model to improve the web page classification accuracy and the efficiency of navie Bayesian classifier.(3) We proposed a focused crawler architecture, and introduced its modules and their implementation technology. Based on this architecture, we implemented a focused crawler. We further use this crawler to verify the focused crawler architecture. The result shows that the focused crawler architecture has good search effects.

Keywords/Search Tags:

Topic crawler, Weblech, naive Bayesian classification, textclassification, LDA topic model

PDF Full Text Request

Related items

1	Research On Domain-Specific Web Information Collection And Topic Detection And Its Application
2	The Research And Implement Of Topic-focused Web Crawler Based On SVM Classification Algorithm
3	Web News Gathering Based On Hierarchical Topic Model
4	Research On The Key Technology Of Focused Crawler
5	Research And Design Of Topic Crawler Through Tunnels Algorithm
6	Research And Analysis Of Micro-Blog’s False Topic Based On Bayesian Model
7	The Design And Research Of Topic Web Crawler In Vertical Search Engine
8	The Design And Implementation Of Topic Web Crawler About Mining Equipment
9	Application Research On Event Driven And Protocol Driven Of Given Field Oriented Of Topic Crawler
10	Design And Implementation Of Focused Crawler For Blogs