Font Size: a A A

The Design And Analysis Of Focused Crawler Based On Dynamic Conceptual Graph

Posted on:2014-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:D W ZhangFull Text:PDF
GTID:2268330401483019Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The advent of the information age has increased the content ofinformation within the internet in exponential mode, so it makes the research of howto extract useful information from the internet in an efficient way become animportant topic in network information retrieval domain. Common Search Enginespresent powerless on mass data and explosive growth of the internet. At the same time,users of the internet demand comprehensiveness of data and more large speed ofupdate of data. What they watch are not only irrelevant keywords but also specifictheme or domain. That’s why focused crawler has appeared. Focused crawler is animportant part of search engine. It’s designing target is to crawl relative pages asmany as possible and exclude irrelative pages as many as possible so as to savenetwork bandwidth and memory spaces and improve crawling efficiency andcoverage rate of focused crawler. Beginning with characteristic of focused crawler,this paper makes a detailed research on focused crawler. The work is as follows:1. Based on two basic characteristics of web pages, this specification puts forward anew algorithm which separates web pages to blocks using only detected separator bars.Using the conception of relative composing, it solves a problem how to relativelyposition page blocks under the circumstances of heights of them partly beingunknown. By limiting total number of blocks, length of string of current note, widthand height of it, this algorithm determines whether a web page block can be separatedcontinually. Using the character of unification prior to separate page, this algorithmimprove efficiency greatly. It separate page directly by detecting separating bars. Itavoids plenty of extractions from one node applying node character sequence. Itincreases blocking efficiency largely by It is a top-down approach and vary efficient.2. This paper raises three observational theories at first, secondly, obtains a fewconclusions based on them, such as: content page determining on the combine ofSBPS and unified character of web pages, proposing the conception of algorithmserver based on the stability of web site, proposing focused crawler based on dynamicconceptual graphs and the frame work of it based on the similarity of classification ofa specific theme.3. In the calculation of theme relevancy, this paper synthesizes all kinds of aspects byapplying weighting adding, and introduces the conception of layer to represent the distend between keywords and theme, and divides the keyword into more pieces onlayer weight computing, and takes forecast node into account on the theme relevanceforecasting.4. It proposes node structure of the conceptual graph based on which it draw a methodof the dynamic renewing of the conceptual graph. In order to assure the extensibilityof the theme and avoid shifting of the theme, it puts forward the conception ofexclusive word and provides two specific methods of theme extending in accordancewith two different kinds of ways of theme extending.
Keywords/Search Tags:focused crawler, page segmentation, SBPS, conceptual graph
PDF Full Text Request
Related items