Font Size: a A A

Design And Implementation Of News Website Crawler And Classification Retrieval Platform Based On Microservice

Posted on:2021-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:B K ChenFull Text:PDF
GTID:2428330647956714Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of artificial intelligence,deep learning technology is widely used in natural language processing,knowledge map,computer vision and other fields.The application of deep learning technology often needs a lot of data to train the relevant models.How to obtain the data for model training efficiently and quickly is the first problem to be solved in the implementation of deep learning algorithm.Aiming at the problem of how to collect text data in the field of natural language processing,this paper proposes a news website crawler system based on automatic parsing algorithm.Combining with text classification and news retrieval system,it greatly reduces the cost of data acquisition and can provide data services for other systems.The main work of this thesis is to capture the key information of web pages by using web crawler technology based on automatic parsing,store the data after text classification,and finally provide data service of full-text retrieval.After demand analysis,the platform is divided into three subsystems: web crawler system,text classification system and news retrieval system.The specific contents related to the completion of each system software are as follows:(1)Based on spring cloud design and implementation of a distributed news,blog type website crawler,mainly scheduling,downloading,parsing,saving four modules.The automatic parsing algorithm based on text density is used to realize the parsing of different website titles,time,content and other information.Kafka is used as the middleware of message communication between each module of the crawler system to improve the overall throughput of the system.(2)A Chinese text classification algorithm is implemented by using bidirectional encoder representations from transformers(BERT)model as word vector model and deep pyramid revolutionary network for text categorization(DPCNN)model as text classification model.(3)Based on elastic search,a full-text search engine,a search platform is implemented,which is stored in different indexes according to different categories.The system implemented in this thesis,after the actual production test,the results show that the system can greatly save the cost of obtaining text data,and bring great convenience to practitioners engaged in the field of natural language processing.
Keywords/Search Tags:Microservice Architecture, Crawler, Bert, DPCNN, Text Classification
PDF Full Text Request
Related items