Font Size: a A A

Crawling Search Strategy Subject-oriented Research And Realized

Posted on:2013-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:M X WangFull Text:PDF
GTID:2218330374961938Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of the Web makes the information on it more and more. Infinite information makes the traditional general-purpose search engine technology encounter a lot of problems such as low coverage rate, resource-consuming, slow updating and low satisfaction rate etc. To overcome the deficiencies of the general purpose search engine to meet the query needs of the specific users in specific areas, vertical search engine, or specific topic oriented search engine was developed which can provide search service with more detailed, more precise classification, more comprehensive data, and which has become new direction of search engine.Focused crawling technology is based on the traditional search enginetechnology where the crawler crawles the entire websites, applys the learning technology of machine learning to control the grasping objects of crawler, and makes crawling program to grasp as many as possible the website related to the designated topicsNowadays, the researches of topic focused crawling are mainly concentrated on two hotspots. One is how to automatically categrate website text, that is, how to determine whether a website relate to the designated topic; another is what crawling strategy can download as many as possible website, avoid downloading the websites that do not related to topics, and improve the coverage rate of designated topics.In this paper, we analyzed the search strategy of web crawler which is the main technology of focused crawling, the distribution of specific topic websites on Web, and the algorithms for determining the relevance. Based on above analysis, we proposed a subject-oriented web crawler frame whose main modules are detailedly introduced later. We then realized a subject-oriented web crawler based on Weblech. The idea behind this crawler is using the corpus training a classification algorithm to get a Naive Bayesian classifier. Analyze the grasped websites, store them for building index if they are topic relative, and discard them if not. The implementation is sample. It improved the crawling speed and recall rate.In this dissertation, we discussed related technology for implementing a focused crawler based on Weblech and Naive Bayes classifier. The main research works are summarized as following:(1) A new web search strategy is proposed. The main thought behind this strategy is that divide websites into Hub type websites and content type websites. If the crawler encounters a Hub type website, just neglects it so that the grasping process is more efficient and that the tunnel phenomenon can be solved. As a result, the coverage rate and relevancy of the crawler are improved.(2) We Studied the principle of naive Bayesian classification algorithm and vector space model. We proposed a new classifying way which uses weighted feature of LDA topic model to improve the web page classification accuracy and the efficiency of navie Bayesian classifier.(3) We proposed a focused crawler architecture, and introduced its modules and their implementation technology. Based on this architecture, we implemented a focused crawler. We further use this crawler to verify the focused crawler architecture. The result shows that the focused crawler architecture has good search effects.
Keywords/Search Tags:Topic crawler, Weblech, naive Bayesian classification, textclassification, LDA topic model
PDF Full Text Request
Related items