Font Size: a A A

Extracting Precise Link Context From Web Page

Posted on:2005-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y XuFull Text:PDF
GTID:2168360125950335Subject:Software and theory
Abstract/Summary:PDF Full Text Request
It is a reasonable assumption that anchor text in the HTML file and its relevant context contain concise but precise semantic clues as to the corresponding target page's content, and these hints are usually sufficient for the potential human readers to follow the link. Not surprisingly, these link contexts have been exploited extensively ever since the advent of the World Wide Web. For instance, Google uses anchor text to index URLs; in the CLEVER topic distillation system, the hyperlinks are weighted by the relevance of their link context with respect to topic specific query to mitigate the topic drift problem in the HITS algorithm; some researchers investigated the feasibility of making use of link context to supplement or even replace document content to categorize web pages. In scenarios where access to the target pages is prohibitively expensive, the link context begs to be extracted and put to best use. Such is the case of focused crawling, or topical crawling, which aims at crawling topic specific web pages and whose success relies on exploiting relevant information from visited pages to predict the unvisited pages' relevance as precisely as possible. In spite of its apparent significance, the approach to extracting precise link context has not been fully explored and many state-of-the-art extraction methods are based on simplistic heuristics and require ad-hoc parameters. Anchor text seems reliably indicative of the target page's theme, but its characteristic terseness circumvents high recall from information retrieval perspective, and na?ve exclusive reliance on it can even degenerate retrieval performance, which has been confirmed consistently by some researchers. Besides anchor text, neighboring text around the anchor and proceeding header text have been considered as potential candidates to link context. However, these unfiltered contexts may bring much noise and, compared with anchor text, these mediocre texts usually dilute link context's relevance further.In this paper, a novel extraction approach based on natural language processing technique, i.e. English parser is proposed. We argue that such NLP tool can help filter irrelevant or noisy words, and extract relevant context of high quality at the same time. Consequently we achieve the goal of fine-grained simulation of human readers' browsing behavior. Preliminary experiment demonstrated its superiority to other extraction methods.
Keywords/Search Tags:Extracting
PDF Full Text Request
Related items