Font Size: a A A

A High Concurrency News Collection System Based On Web Information Extraction

Posted on:2022-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:J K HanFull Text:PDF
GTID:2518306764992479Subject:Journalism and Media
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,news media have realized the goal of faster news dissemination and richer news content by virtue of the characteristics of fast Internet transmission.Traditional news media have become new media with the transformation of the Internet.However,in the face of the amount of news on the network,how to extract the text data of the whole network quickly and accurately,which plays a crucial role in the later news text mining,news reporting and public opinion guidance.In recent years,great breakthroughs have been made in the research of template-based web information extraction technology and distributed crawler framework,but there are still many challenges.The rapid change of web file structure leads to the inability of the original template to meet the information extraction of existing web files,and the inconsistency of different web page extraction templates leads to the need to spend a lot of manpower to develop the web page extraction templates of various websites.In the design of distributed crawler framework,the main node pressure is too large and single point of failure,so it greatly affects the web page file collection speed and crawler system security of the whole framework.Under this background and requirement,this paper designs and implements a high concurrency real-time news collection system on the basis of improving the template based web information extraction method and the distributed crawler framework.The specific research contents are as follows:(1)Proposed a template generalization web page information extraction method.In this method,web page information can be automatically extracted by generating Xpath and making web page template according to corresponding rules.Compared with the existing web page information extraction technology,the extraction time,accuracy and universality of the overall comparison has a better effect.(2)Design a high concurrency distributed crawler framework.The framework for multimachine multithreading crawl mode characteristic,use URL crawl machine to replace more traditional distributed crawler frame of the master node URL management functions,crawl machines for web HTML file file downloads,and through the real-time data stream processing technology and template generalization web information extraction method of the unity of the web page file parsing.Experiments show that this framework solves the problem of low efficiency and security of traditional distributed crawler framework,and improves the speed of web page crawling.(3)Design and implement a high concurrency news collection system.Distributed crawler frame of the system with high concurrency system core,the implementation of concurrent crawl web documents,combined with the generalized web information extraction method to realize web template file of news text extraction,and the web page file crawl for statistical information,news text parsing information,realize the real-time monitoring of news text extraction conditions.After the system test,all functions run normally and meet the requirements of real-time collection of the whole network news website.Figure [37] Table [27] Reference [62]...
Keywords/Search Tags:Web crawler, News gathering, Information extraction, Xpath, Big data, High concurrency framework
PDF Full Text Request
Related items