Font Size: a A A

Focused Crawling Based On Page Segmentation Technology

Posted on:2008-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:C Q ZhangFull Text:PDF
GTID:2178360212496904Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the high-speed development of Internet technology, the network resources are also increasing with incredible speed. It is due to this reason, the search engine has become familiar to all web-surfing. But such a kind of mature technique face the huge network resources which already gradually on today's Internet had the felling for lack the ability to follow it, even the best search engine now also could not crawl along 10 percent of all web pages at most. Currently, most of search engines are majority to for all information on the net which could be called general search engine, But with the growth of a wide range of information, the search engine obviously has already can't satisfy many customers who need to a specific query, their need of information usually aim at the restricted areas and face to particular topic, but the recall and the precision rates of the general search engine are two low. So, we need a topic-oriented search engine which has a high precision of classification and update in time. The focused crawling machine is the main part of the special search engine which can crawl new pages on the net at time and update the database in time. The focused crawling can imitate the situation of the people who surfing at a certain degree. It can make use of the information on the page to guide crawler to collect the relational pages. The purpose that the focused crawling is valid and quickly find out web pages which are relational to the defined topic. It don't collect and index all web pages that can crawl, but only analytical the page that related topic. Then, it can save effectively hardware and network resources.Today, the internet has already become the largest source that people get information; a lot of information retrieval systems which take the whole page as the least unit and indivisibility, but a web page usually can't mean a single semantic unit. A web page usually contains a lot of contents, such as navigate, copyright, advertisement and topic contents etc., furthermore, a web page usually has several topics, and these topics are likely to not related with each other. Such as there are research introduction and life introducing in a blog article of someone. The existence of many topics may result in "tunnel" problem. The so-called tunnel problem is to say, there are several pages which are not relevant to topic between our target web page and the page we have already got, but there are some links among these web pages, the tunnel problem is similar to our car need to pass a tunnel and would arrive at destination, the tunnel in which is those irrelevant web pages, we can not help to ask, since these pages are irrelevant web page, why there are some links in these pages link to our topic? Absolutely, these so-called irrelevant web pages actually were classified incorrectly and then we consider a relevant web page as an irrelevant page.Usually there are several topics in a web page, but the current focused crawling all take the whole page as a unit when processing a web page, so it can't identify availably those content blocks which related topic and result in topic drift easily. Aim at above phenomenon; we apply the page segmentation to the focused crawling. When processing the content of a page, we don't take the whole web page as a unit, but content piece called block. Our main works are bringing forward a new page segmentation method and finishing a focused crawling system. The method that make use of the vision information, tag information and link information which in the web page. The vision information contains background color, font size and color etc.; the tag information is to use an order tag collection {, ,

,


} to recursive segment page, and the link information is make use of "pagelet" concept and the anchor text. We bring forward to a lot of heuristic rules to control the accuracy and grain degree of the block when segment a page.We parse a web page as a DOM tree at first, then pick-up the vision information ,tag information on each node in tree .Then use our page segmentation method to segment the page, after that , we label each block, then use some heuristic rules to identify the noises block and contents block, and then we use a block mergence method to merge content block , so we get the content blocks which have following characteristics: single topic, lost few contents information and don't have noise .At last, we use a naive beys classifier to identify those blocks which related to the definite topic, we only deal with those relational blocks and pick up the links in them. We use a classifier to give a important point to each block, and use the important point as the PRI of the links which in the block, the order of the link in the crawling poor by the PRI ,so the crawler crawl the page which is the most relational topic at first.We applied the page segmentation to focused crawling and get some blocks which are relational to topic, drop off the noisy block and irrelevant blocks, we only analysis the information within relational blocks, so the crawler only crawl the relational page. By the page segmentation technology, we can get good effect when the crawler do web page prediction .The experiment indicate that the unit-based focused crawling can get higher harvest ratio than page-based focused crawling, and it can solve the multi-topic problem, can avoid effectively topic drift phenomenon and solve the tunnel problem to a certain extent.
Keywords/Search Tags:Segmentation
PDF Full Text Request
Related items