The gold futures market is active,is one of the earliest futures markets,and plays an important role in society.Predicting the price of gold futures not only helps investors to obtain decent returns,but also helps hedgers develop sound strategies.At present,the research on gold futures price forecast is divided into two categories.One is based on market conditions,such as the supply and demand relationship of gold futures in a period of time,relevant policies formulated by the state,and investor sentiment in the market.However,this method is relatively subjective and difficult to quantify.The other is based on the quantitative data of the financial market,using quantitative data to build a model to predict the future price trend of gold futures.This method is relatively objective,but it is difficult to comprehensively determine the relevant variables for predicting the price of gold futures,resulting in a low model accuracy and also ignores important factors such as investor sentiment and policy that are difficult to quantify.This paper proposes a gold futures price prediction framework based on text data and multi-dimensional feature engineering,aiming to improve the accuracy of the gold futures price prediction model from the perspectives of variable selection and feature engineering.Based on the text data of news headlines,the Sentence-LDA topic model is used to extract five types of features that affect gold futures prices from a large number of news headlines,including gold futures itself,metal futures,other industries,exchange rates,and indices.Then,according to these five types of characteristics,229 latent variables that affect the gold futures price are determined.First,some useless variables are removed,and features that are linearly and non-linearly related to gold futures prices are selected using cross-filtering.Then,for the remaining 49 features,such as silver futures open interest,the exchange rate of EUR/RMB,and the US dollar index,three types of features are constructed,namely,the lag sequence containing lagged information,the difference sequence containing incremental information,and the derivative sequence containing trend information.Extract information from multiple perspectives.Based on the classical tree model embedding method,the improved embedding method is Ta-CFI from the perspectives of the number of feature splits,information gain and sample coverage.Finally,use Ta-CFI to sort the importance of the features,and determine 50 variables such as the closing price of silver futures,the exchange rate of USD/RMB,hot-rolled coils,and derivatives of gold futures as the final model features.And use the Bi-LSTM model for sentiment analysis to quantify sentiment indicators such as investor sentiment and policies contained in news headlines.Combined with the 50 variables determined above,on the one hand,the effect of the model before and after adding emotional features was compared to verify the validity of emotional features.The reconstructed data is used as the input of the model.The benchmark features for comparison are 11 variables such as the widely used gold futures opening price,highest price,lowest price,trading volume,trading volume,open interest,MACD,and DMI.Three different models of machine learning model support vector regression(SVR),ensemble model extreme gradient boosting(XGBoost),and deep learning model fully connected neural network(MLP)were selected to predict gold futures prices.The empirical data are all from the WIND database,and compare the performance of the models under different features to verify the effectiveness of the method proposed in this paper.The results show that the model variables determined by the framework based on the combination of news headlines and multi-dimensional feature engineering proposed in this paper are far better than the models constructed by 11 benchmark features on three different types of models,and the addition of emotional features is beneficial to improve the performance of the model.As far as the XGBoost results are concerned,the MSE of the model constructed by the benchmark features on the test set is 0.0021,and the MSE of the method proposed in this paper on the test set is 0.0007,which is reduced to one third of the original,and the MSE decreases after adding the emotional feature is 0.00045.Therefore,the gold futures price prediction framework proposed in this paper improves the forecasting effect of the model,can effectively predict the future gold futures price,and provides a reference for investors,which has certain practical value. |