Font Size: a A A

The Research On Key Techniques For Page Segmentation Based Forum Crawler

Posted on:2010-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:D F ZhangFull Text:PDF
GTID:2178360332957872Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with Web2.0 technology development, as a typical application of user-created content, Web forums are very popular in the world. The countless topics or issues of all in web pages has been created by Internet users, so the forum has become the bearer data is actually a huge data set of human knowledge. As the network increases the amount of information in the forum, the forum website to collect accuracy and efficiency are faced with enormous challenges. In this situation, we urgently need to develop an effective forum crawler system.This forum crawler around the cutting-edge technology, through in-depth study of the characteristics of Internet forums, web crawler works and related technologies, based on the idea to block the introduction of web pages to the forum, crawlers crawling strategy, as well as in various forums how to crawl on the different forums and servers in this page. The main results can be summarized in the following areas:In this paper, a great deal of research on the internet forums and in-depth analysis, based on summarized the basic characteristics of a forum; and then the problems of the existing crawlers crawling forums were analyzed to identify the root causes that caused these problems; in order to solve these problems, the subject crawlers crawling strategies inspired by the ideas we introduced to the page segmentation in the forum crawler, and some ways to optimize the forum crawler. In this paper, we study some page segmentation algorithms and propose a good page segmentation algorithm -- based on the web page structure of vertical segmentation algorithm (WPS-VSA). The experimental results on several forums show that this algorithm has a better general, but also on the forum pages have a high precision of the page segmentation.This paper studies a number of web crawler crawling strategies, and then proposes a general forum crawling algorithm for the majority of Internet forums -- based on the web page segmentation of forum crawling algorithm (WPS-FCA). We use this algorithm can filter invalid page online links; while taking advantage of the characteristics of the forum solving the page-flipping problem, for those who use the forum web content data mining application has laid a good foundation. The experimental results show that this algorithm not only saves the network bandwidth to download the forum pages and storage space for web pages, but also greatly increased the forum crawl pages accuracy ratio and coverage ratio, but also facilitates a variety of applications using the data content for the forum.Based on the above results of theoretical studies, this paper designs and implements a prototype forum crawler system.
Keywords/Search Tags:web crawler, forum, page segmentation, crawling strategy, document structure model
PDF Full Text Request
Related items