Font Size: a A A

Based On The Url And Context Of Parallel Block Processing Research Topic Crawler

Posted on:2013-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2248330395450199Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Combining the ontology analysis, network topology analysis, and tunneling technology for algorithm improve, I propose parallel block processing focused crawler based on URL and context. This algorithm expands the topic keywords based on Hownet as the input of thematic relevance. The theme correlation analysis algorithm divides the page into multiple data block containing one link, and parallelly analizes the link in each block in the aspects of network structure and text context. Network-topology score is derived according to the similarity of the link structure and its parent link structure. Context score of the link is calculated depending on the frequency and location information of topic keywords in the context around this link.The total score is derived by the adjustable parameter a to balance the effect of two factors. The link is judged to be related to the topic only if the total score is greater than a certain threshold. The algorithm implements improved filter tunnel technology to ensure the havest rate. When tunnel depth is greater than4, the link score less than certain threshold is filtered to avoid irrelated page. The experimental results verify that the based on the parallel block processing focused crawler based on URL and context is a high flexibility, high accuracy focused crawler best for maximum crawling layers of4-6.
Keywords/Search Tags:Focused Crawler, Link Analysis, Ontology, Parallel Processing
PDF Full Text Request
Related items