Font Size: a A A

Partition-Based Web Page Prediction

Posted on:2006-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2168360155452976Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently, towards topical crawling has become one of the new hot researching topics in the field of network information retrieval. Its utmost destination is to collect web pages, which are related to the themes users retrieve. While being the key of the whole collecting process, web page prediction is also the key of towards topical crawling. Generally speaking, users who have the knowledge of relevant fields are very skillful at predicting whether the oncoming result is what they are interested in. However, it is quite challenging to manipulate computers to simulate human being's web page prediction behavior. First of all, how to acquire field-related knowledge during web page prediction process is one of the challenges we have to face; secondly, during web page prediction process how to filter the pages, which are ostensibly irrelevant to the themes users retrieve, namely to filter the so-called "tunnel"page is another challenge; thirdly, how to make web page prediction automatically improve its accuracy by continuous learning is another challenge; lastly, how to make web page prediction adaptable to the more and more complicated page development trend on internet is another challenge. By studying on the traditional web page prediction methods, we find out the common ground of their arithmetic is to take the whole page as the smallest processing unit during web page prediction process. In other words, they employ characteristic vectors to respectively represent each category on candidate pages in some way, like content information, interlinking information etc. The feedback of this kind of processing is quite satisfactory in the early days of Internet experiments because most of web pages at that time are static HTML pages, what is more, the structure of page is very simple and the content of web page only involves some theme. That is to say, most of web pages are static and single-theme. For these pages, by making characteristic selections appropriately, we can absolutely employ only one characteristic vector to represent the whole characteristic information. But with the rapid development of Internet, not only the exponential increasing of pages'number on Internet, but the highly complicated degree of pages, pages have already transformed from static and single-theme model to dynamic and multi-themes model. Up to now it is more and more difficult to differentiate web page and web site. In such situations, it is almost impossible to employ only one characteristic vector to represent the whole characteristic information on the web page. And it is quite easy to neglect some important details, which are highly relevant to users'retrieving themes, so as to miss or delay the crawling parts. In addition, this phenomenon will become more serious with the increasing degree of page complication. Therefore, we propose a new web page prediction method with better adaptability. Based on the traditional page partition prediction, it integrates page partition and web page prediction. First, utilizing page partition appropriately, it divides candidate pages into several maximal "pagelet"s (clusters of pages), each one with only one explicit theme or function; secondly, we employ "interest ratio"to measure the relativity degree between candidate page and users'retrieving theme. We calculate the interest ratio between pagelet and users'retrieving theme by the relevant field knowledge we accumulate in the crawling process, like content information, address information, father interlinking information and brother interlinking information etc; thirdly, we utilize the probability model we get in experiments to make a weighted process on the interest ratio so as to get a "weighted interest ratio"; lastly, we make web page prediction in accordance with the weighted interest ratio. In this way, traditional rough processing method, which takes the whole page information as the smallest processing...
Keywords/Search Tags:Partition-Based
PDF Full Text Request
Related items