| Value-based methods and policy-based methods are the main methods of deep reinforcement learning applied to quantitative trading strategies.Although the representative deep Q learning of value-based methods makes a good profit in monotonic market environment,it loses a lot when trends change.The representative deep recurrent reinforcement learning of policybased methods behaves much better in fluctuating market environment,however,the need of discretizing outputs when making decisions and the lack of value function to correct the updating direction of parameters when training lower the model capacity and hence reduce the profits.To make more profits,we show how to apply the deep actor critic methods to quantitative trading strategies focusing on increasing the model capacity,improving the ability to adapt to new trends and accelerating convergence.To achieve the above goals,we propose a quantitative trading strategy model based on deep policy gradient methods called deep actor critic trading(DACT).Firstly,we propose DACT with state value(DACT-SV)which apply deep actor critic methods to quantitative trading strategies to improve the model capacity and the model adaptability.Secondly,we propose DACT with q value(DACT-QV)which substitute the state value network with q value network.And to generalize better,we share the LSTM network which extract the financial environment features;to improve the adaptability further,we do the internal bagging on the q network and the policy network;to speed up the convergence,we adopt the parallel exploration mechanism.Finally,we verify the effectiveness of DACT by comparing it with deep Q trading(DQT)and deep recurrent reinforcement trading(DRRT)on the stock index data SSE 50,CSI 300 and CSI 500.The main innovations and contributions of our work are as follows:1)Implementations and improvements of DACT-SV.We apply the deep actor critic method to trading problems.For the regularization purpose,a shared LSTM network is adopted to do feature extraction.Experiment results show that the daily average profit of vanilla DACT-SV on the CSI 300 from 2013 to 2018 is 1.61 points,the one uses different LSTM makes 0.34 points more and the one uses shared LSTM makes 0.33 further.2)Implementations and improvements of DACT-QV.We substitute the state value network in DACT-SV with q value network.And parallel exploration is used to accelerate training process,voting bagging is used to improve the model adaptability.Experiment results show that the daily average profit of DACT-QV is 2.14 points training each round 20 epochs,which is comparable to DACT-SV training each round 100 epochs,and it spends only a quarter time.3)Comparison with DQT and DRRT.Experiment results show that the daily average profit of DACT on CSI 300 from 2005 to 2018 is 2.67 points,1.46 points more than DQT,1.02 points more than DRRT,on SSE 50 from 2004 to 2018 is 2.28 points,1.17 points more than DQT,0.56 points more than DRRT,and on CSI 500 from 2007 to 2018 is 5.38 points,3.5 points more than DQT,1.6 points more than DRRT. |