Font Size: a A A

Link For Forum Crawler Extraction Algorithm

Posted on:2014-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:R W MaFull Text:PDF
GTID:2268330401973365Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Forum in the Internet world is an irreplaceable platform, which is backed by all kinds of people through the Internet organized into community, and exchange information with each other through their respective topics of interest.Nowadays, the Forum will produce vast amounts of data. To quickly and efficiently retrieve the information of interest to the user that is a huge challenge to the researchers of each web crawler. In today’s mainstream generic crawler, crawling is often taken through a page on any of the links to a forum, and then as the starting page for the entire forum crawling, crawling to the page to extract the URL through link analysis algorithm corresponding weights, and set threshold to determine whether the link is worth crawling. This type of crawling is often time-consuming and labor-intensive, although the ultimate crawling results and efficiency can be accepted, but for crawler designers and researchers, how to improve the efficiency of the crawler crawling is always our goal. The purpose of this study is the extraction process by simplifying the link as the crawling the starting point to improve the recall rate of crawling by finding Forum homepage. This paper studies are as follows:First, the the generic crawler Forum crawling process is often a page in the forum to start the entire forum site crawling, there are still many problems in such a way, the starting point Forum page contains links to the comprehensiveness of deficiencies, difficult to achieve high coverage. In this paper, observations on the forum site, noting the importance of crawling crawling from the entrance page, to propose appropriate algorithm to find the starting point, and achieved very good results in the experiment.Second, crawling in the face of massive forum page as soon as possible and fully climb to take it is particularly important to the target page. In this paper, after some certain observation of the forum and the analysis of the page structure tree, propose forum browsing mode EIT path, the extraction of the forum page by the previous depth traversal mode transformation for the relevant ITF link is the type of match, greatly reducing the crawler the crawl space and time complexity.Third, The proposed EIT path corresponding to the ITF is the type link extraction algorithm, and the extract to the original link further by is the type matching in addition to the noise processing, successfully reduced under the premise of ensuring the recall rate extracted to useless and duplicate link number.In this study, after observation and analysis of the structure of the Forum’s website, propose crawling strategy oriented forum. Experiments show that the method for improving crawling efficiency and reduce the noise of the page has a good effect.
Keywords/Search Tags:Forum Crawler, Page denoising, Algorithm optimization, Forum
PDF Full Text Request
Related items