Font Size: a A A

Researchand Designof Word Segmentation Systemfor News Webpages Basedon Crawler

Posted on:2022-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y H QinFull Text:PDF
GTID:2518306485458644Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the context of the big data,how to obtain a large amount of data and process it as the data needed in research work has become the primary issue in the field of big data research.In the field of news data research,such as the prediction of hot topics in online news and the monitoring of news public opinion,the acquisition of news data and word segmentation are both the basis and the key.How to collect and store target news data in a short time and accurately complete data segmentation problems involve multiple aspects of subsequent research work: for example,the accuracy of word segmentation results directly affects the accuracy of subsequent data analysis.Aiming at the two primary problems of target data acquisition and text segmentation in the field of news data research,the paper takes CCTV News Network as an example,According to the characteristics of the web page structure and data distribution,the Scrapy web crawler framework combined with Bloom filter algorithm is used to solve news target data collection and storage problem;By combining the Albert(A Lite Bert,lightweight Bert)pre-training language model based on the bidirectional Transformer structure and CRF(Conditional Random Field)to construct the Albert-CRF model to complete the text segmentation task of news data.Through research,design,experiments,etc.the CCTV news webpage data collection based on the Scrapy crawler framework and the news webpage word segmentation system of the Albert-CRF model word segmentation are implemented,and it is verified that this research topic can provide effective,reliable data collection and text segmentation processing methods,meeting the task needs of research.The work content is mainly divided into the following aspects:(1)Research and analyze the characteristics and data distribution of CCTV news website webpages,combine the Scrapy framework and Bloom filter deduplication algorithm to develop automatic data collection methods based on news categories,and realize the data collection and storage of news webpages.(2)According to the "People's Daily" corpus format,constructing a corpus of social news topic areas for model training and text segmentation tasks of news webpage word segmentation system.(3)Based on the maximum length limitation of the input sequence of the Albert pre-training model,this paper designs a text preprocessing module to segment the text with a length of more than 512,and perform unified,standardized processing of the text content.(4)Based on the Albert pre-training language model combined with CRF's tag constraint ability,propose and construct an Albert-CRF word segmentation model for news data word segmentation tasks;under the same experimental environment,hyperparameter settings and corpus,the Albert model And the Albert-CRF model proposed in this paper is used for text segmentation experiments: the experiment proves that accessing the CRF layer after the Albert model can effectively constrain the output tag sequence,and the word segmentation performance has been improved.Based on the subject research and related work,the paper mainly introduces web crawler related technology,Scrapy crawler framework principle,work flow and design process of news web data collection module in the aspect of news web data collection;In the research of word segmentation technology,the research and development of word segmentation technology are briefly explained,and the word segmentation pre-training model in the subject research-Transformer model structure and mechanism,Bert and Albert pre-training model principles and structure,the prediction algorithm of conditional random field(CRF)and related experimental process.
Keywords/Search Tags:News data, Chinese word segmentation, Scrapy, Albert, Transformer
PDF Full Text Request
Related items