| Background: Since the outbreak of the Novel Coronavirus(COVID-19)in 2019,it has been widely spread around the world and become a public health emergency of international concern.The accurate prediction for the Novel Coronavirus outbreak has become one of the hot research topics.The main shortcomings of the existing methods for predicting are as follows: the data used are mainly the daily epidemic data compiled by the Chinese Center for Disease Control and Prevention and Prevention(China CDC)and released in the national infectious disease surveillance system,which has the problems of less data and less real-time;although machine learning models have been initially applied in the study of the prediction,there are several models to choose from and it is unclear which model is more suitable for the Novel Coronavirus prediction;the selection of parameters of the machine learning model has a certain influence on the prediction accuracy,and further research is needed to optimize the selection method of model parameters;whether the proposed model can be applied in the actual system also needs to be studied thoroughly.Aims: Forecasting studies of the Novel Coronavirus will provide a timely and accurate overview for the development of the epidemic,which will facilitate the formulation of relevant national policies,the design of appropriate control measures proposed by scientists,doctors and other experts,and the prevention of the spread of panic among the general public due to lack of knowledge,thus achieving the effective control for the Novel Coronavirus and providing technical support for its eventual eradication.In conclusion,the development of forecasting studies on the Novel Coronavirus is of great significance for the timely warning and effective control of the epidemic.Methods: Considering that a large amount of social media data about the Novel Coronavirus(COVID-19)has been accumulated on social media during its outbreak,this paper proposes a method for collecting data about it from China CDC and social networks,and analyzes their characteristics and the correlation between them.In order to select the most suitable machine learning models for the prediction,the Random forest(RF),Back Propagation Neural Network(BP),Multi-variable Linear Regression(MLR),K-Nearst Neighbors(KNN),Logistic Regression(LR),Light Gradient Boosting Machine(Light GBM)and other common machine learning models are compared and investigated.To improve the prediction accuracy of machine learning models,Grid Search(GS),Random Search(RS)and Bayesian Search(BRS)are investigated,and on this basis,improved machine learning models are proposed.In addition,practical programming experiments are conducted to assess the usability of the improved model.Results: Using the China CDC and social network data collection method proposed in this paper,the historical data set of the Novel Coronavirus in China CDC and social networks(Baidu index and 360index)during the period of 2020.01.01-2020.12.14 can be obtained.The numerical simulations show that the social network data with the keywords "novel coronavirus","dry cough","fever","novel pneumonia ","dyspnoea","cough" and "coid-19" has a strong correction with the China CDC data.The comparative study of six commonly used machine learning algorithms shows that RF>Light GBM>MLR>LR>BP>KNN in in terms of prediction accuracy.The comparative investigation of the three tuning algorithms shows that BOA-Light GBM>Random Search-Light GBM> Grid Search-Light GBM>Light GBM in terms of the effectiveness of Light GBM model parameter selection.Based on this,the commonly used and the improved machine learning models are implemented programmatically in Python on the machine learning integrated development platform Spyder.Conclusions: It is feasible and effective to introduce social network new crown epidemic data in the Novel Coronavirus prediction;the machine learning model of RF combined with Light GBM can improve the prediction accuracy;the proposed BOA-Light GBM model can be used as a new model for prediction;the programming implementation in the machine learning platform shows that the proposed method can be applied in practice.The research results of this paper can guide the formulation of the Novel Coronavirus prevention and control plans,and provide important technical support for controlling and eliminating the spread of the epidemics. |