In the optimal strategy learning based on the visual perception task,learning algorithms combine the visual perception task and the decision control task in a general form,and directly maps the input image to the control of the agent in an end-to-end way.The method has been widely used in many fields.However,optimal strategy learning methods usually focus on the design of expert samples and algorithms,ignoring the importance of feature extraction.Simple models makes the ability of feature extraction poor.In addition,the algorithms usually only pay attention to improve the performance of the strategy,while ignoring the potential factors that affect the decision-making of the strategy,making it difficult to explain.Finally,in the dataset of learning strategies,non-stationary distribution of samples often exists,which limit the learning strategies.This paper studies these problems and proposes a learning model based on LSTM-FCN(Long Short Term Memory-Fully Convolutional Network),which is dedicated to studying an end-to-end optimal policy learning method.Specific research contents are as follows:(1)Aiming at the problem of poor feature extraction ability,a DF-PLSTM-FCN(A Dual-Fusion-Based Parallel LSTM-FCN Model)network model is proposed,which uses dual feature fusion and decision fusion technology.The accuracy of image recognition is improved,and the generalization ability of the model is enhanced.(2)In order to discover the hidden factors in the learned strategy,the optimal strategy learning method LSTM-FCN-Policy(Long Short Term Memory-Fully Convolutional Network-Policy)based on the generation of adversarial imitation learning model is proposed.Not only can it simulate expert samples,but also use the DF-PLSTM-FCN network to learn the hidden factors in the strategy.(3)In order to alleviate the policy limitation problem caused by uneven data distribution in the process of learning strategy,the LSTM-FCN-P-Optimal(Long Short Term Memory-Fully Convolutional Network-Policy-Optimal)model is proposed.The model uses the optimized priority sequence experience playback mechanism to optimize the LSTMFCN-Policy model,which further improves the learning efficiency of the model. |