Font Size: a A A

Research And Application On The Key Technology Of Focused Crawler

Posted on:2016-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q ChenFull Text:PDF
GTID:2298330452965359Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid develoment of the Internet, the scale of Internet expands rapidly and thenumber of web pages increases at a tremendous speed. People have urgent demands to finduseful information exactly and quickly from massive web resources. It is very important forthe Search Engine to collect topic-relevant Web pages and show the retrieval result to useras quickly as possible. This paper analyzes the necessity of research on Focused Crawlerand focuses on the topic relativity analysis of download pages and topic-guide searchingstrategy. Based on the above characteristics, this paper designs a Focused Crawler Systemconsists of seven main modules: page downloading module, content extracting module,topic distinguishing module, link extracting module, link value predicting module,scheduling module, and topic page storing module. The concrete work is as follows:(1) Proposes a page content extraction algorithm based on block density andpunctuation. The page body content follows a certain distribution rules. This paper analysesthe distribution rules in detail and proposes the highly efficient content extraction methodwhich can be applied in almost all web pages.(2) According to the text network features, this paper proposes a Chinese keywordextraction method using semantically weighted network. By means of the above method,we build the category keywords set of training corpus. Afterwards, we design a Na veBayes classifier based on category keywords set to distinguish whether the page is relativeto the topic.(3) Through the analysis of link characteristic and web page distribution, this paperproposes an improved link searching strategy based on link content evaluation.(4) Based on the above proposed methods, this paper designs and implements aFocused Crawler System using JAVA. We designed several experiments to verify thefeasibility of the above methods using the Focused Crawler System, and the results anddata indicate the above methods work very well.
Keywords/Search Tags:Focused Crawler, relativity distinguishing, searching strategy, contentextraction, keyword extraction
PDF Full Text Request
Related items