| In recent years,programmed investment management using reinforcement learning methods is a popular research direction in the field of financial investment.In this paper,we combine semi-markov process(SMP)theory and reinforcement learning methods to construct a programmed stock trading model to guide investors’ investment practices.Firstly,this paper describes the theoretical basis of model construction,including Q-learning of markov decision process(MDP),Q-learning of SMP,and related theories including K-means algorithm.Among them,the K-means algorithm is used to construct discrete environmental states.In particular,compared with the Q-learning of MDP,the Q-learning of SMP considers not only the current state of the environment,but also the residence time of the state.Secondly,the key to constructing the SMP-based Q-learning model in this paper includes environment state setting,action setting and reward setting.20 new energy stocks are randomly selected for training and backtesting,and the results are compared with the MDP-based Q-learning model and the buy-and-hold model,using cumulative return,annualized return and sharpe ratio to measure the model performance.The empirical results show that both Q-learning models achieve robust returns in the individual stock market when K-means is clustered into 3 classes,and are much more risk-resistant than the buy-and-hold strategy,with the average cumulative return(116.54%),average annualized return(30.17%),and average Sharpe ratio(66.65%)of the SMP-based Q-learning model exceeding those of the MDP-based Q-learning model.learning model(47.62%),the average cumulative return(14.37%),and the average Sharpe ratio(42.25%),indicating that the SMP-based Q-learning model can achieve higher returns when taking the same risk.The same conclusion was obtained when Kmeans clustered 6 classes and 9 classes,where the average cumulative return,average annualized return and average sharpe ratio of the SMP-based Q-learning model were the largest when clustered 9 classes,and the model was the most effective.Finally,in order to improve the cumulative return of the Q-learning model of SMP,feature screening is performed from a total of 10 features containing financial indicators,technical indicators and macroeconomic indicators based on the principle of maximizing cumulative return and the ranking results of each feature combination,and the optimal feature combinations are V(daily volume change rate)among trading indicators,PE(price-earnings ratio)among financial indicators,MACD(convergencedivergence moving average)and ADO(accumulation/distribution oscillator)in technical indicators,and IBO007(interest rate)in macroeconomic indicators.The empirical results show that the model improves in terms of returns and stability based on the combination of these five features.Therefore,the SMP-based Q-learning model constructed in this paper has some reference value for institutional or individual investors to develop programmed investment management programs. |