| In China the tobacco industry is a controversial monopoly special industry,and various behaviors in the industry can easily lead to sensational discussions.In the development process,the tobacco industry has always attached great importance to the monitoring of network public opinion,but most of the internal monitoring of network public opinion in the industry is still in the artificial mode.This paper designs and implements a tobacco industry network public opinion monitoring system,which use the web crawling technology in collecting and processing massive data,to realize the full-scale crawling of the tobacco-related network,and provide users with visual sensation Information inquiry,subject tracking,statistical analysis,public opinion monitoring services.This paper firstly investigates and analyzes the research status of network public opinion monitoring,web crawling,natural language processing and machine learning technology.Secondly,its analyzes the collection objects and requirements of the tobacco industry network public opinion monitoring system,and carries out the overall system architecture design,database design and subsystem design according to the requirements and business processes.The system is mainly divided into three subsystems: public opinion collection,public opinion application and system management.In the system implementation part,this paper describes the implementation of each subsystem.Including the Python-based Scrapy crawler framework,the custom strategy should respond to the anti-climbing measures of the website,use Selenium to solve the crawling problem of dynamic webpage data;through the regular expression,XPath selector,with Pandas,Numpy library for data cleaning extraction Screening;based on the Chinese word segmentation of Jieba,constructing the smoke-related lyric dictionary and the part-of-speech library,customizing the extraction rules of the smoke-related lyrics keywords;using Word2 Vec training word vector,PCA for data dimensionality reduction,SVM model for machine learning and emotional words Text orientation analysis;use Wordcloud to generate word cloud,Matplotlib to draw charts,and build Web projects to realize system visualization.Finally,this paper tests and analyzes the system,including setting test content,test methods,test results and so on.The design and implementation of the system greatly improves the monitoring efficiency of the smoke and sensation,and has important practical significance for the maintenance of the social image of the tobacco industry. |