Font Size: a A A

The Design And Implementation Of Distributed Web Crawler System Based On Automatic Extraction Of Webpage Information

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:B D YangFull Text:PDF
GTID:2518306338967769Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of data volume and the arrival of big data era,the Internet generates countless and variable data every day.The information contained in these data has great research value and commercial value.Scholars and companies hope to obtain the latest and valuable information from article-type web pages such as news web pages,policies and regulations web pages.Not only do they require large amounts of latest data in a consistent format,but they also hope to obtain information with low cost and high efficiency.In reality,web pages have different structures and contain a large amount of noise that has nothing to do with the subject.It is a problem that how to obtain valuable structured information from the massive information contained in the Internet at a relatively fast rate and efficiency.And the problem is worth studying.This thesis is selected from the enterprise project.The research proposes an article-type web page information extraction algorithm based on visual block and nodes sequence annotation,and designs and implements a distributed web crawler system based on this algorithm,as follows:(1)Aiming at the problems that existing web page information extraction algorithms are not so accurate,lack information items and cannot make full use of contextual information.This thesis proposes an article-type web page metadata extraction algorithm based on visual block consistency and sequence annotation.Firstly,according to the visual features of the article-type webpages,this thesis preprocesses the webpages into blocks,divides the webpage nodes into multiple consistent visual blocks;Secondly,this thesis locates the main area of the web page and filters out a lot of noise information by using statistical features;Next,this thesis selects text features,visual features and dictionary features as feature set,and uses conditional random field model to annotate sequences to extract information such as title,text,author,source,release time,images and attachments.Finally,the thesis verifies the effect of the algorithm through experimentation and comparison,and the result shows that the algorithm has a better extraction effect.(2)This thesis designs and implements a distributed crawler system.This thesis analyzes the needs of the enterprise,and designs the overall architecture of the system and each layer of the system.The system is divided into data acquisition layer,data analysis layer,data storage layer,node access layer and system management layer.Aiming at the problems of existing distributed crawlers,this thesis uses automatic extraction algorithm for web page information instead of manually writing parsing scripts and proposes a centreless task scheduling strategy based on dynamic feedback,which improves the system reliability and efficiency.This thesis describes the design and implementation of the system in sub-modules.Finally,this thesis testes the performance and functionality of the system.
Keywords/Search Tags:web page information extraction, automatic, distributed, crawler
PDF Full Text Request
Related items