Font Size: a A A

Research On Off-policy Evaluation Based On Key Trajectory Mining

Posted on:2022-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:S R WangFull Text:PDF
GTID:2518306563477664Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Among reinforcement learning(RL)applications,the off-policy evaluation(OPE)is used to avoid unexpected threats before a RL deployment,that has promising future applied into the fields of robot,autonomous driving and so on.No needs to actually execute RL,OPE can estimate target policy's state values through history trajectories collected.Generally,the learning goal of OPE is to minimize the mean squared error of state values between new estimate and actually executed values from target policy.With the development of AI and RL,as a key technology of RL,OPE has drawn a lot of interests in top AI journals and conferences recent years.This thesis systematically summarizes the main OPE methods of latest twenty years,including directed-model based,importance-sampling based,hybrid-model based and PU-learning based methods.We present related theory knowledge of OPE,describe detailed differences of different OPE method's mechanism and model.After performing complete comparisons of methods and corresponding models.OPE has a great progress on its way.However,due to the difference between behavior policy and target policy,and possible reward sparseness of behavior policy in some emerging applications,OPE still has a lot of challenges.This thesis mainly carries out research from the following two aspects:Aiming at the problem that the current off-policy evaluation algorithm could result in large variance under the environment of long trajectory data,our thesis proposed an off-policy evaluation algorithm for long trajectory data(LC-OPE).Our algorithm used the special state which minimizes the mean square error and its state distribution changes to divide the long trajectory from the location of this special state into two sections and change it into short trajectory data,this enabled the off-line strategy evaluation algorithm to be better applied to the long trajectory data environment.Compared with the existing off-policy evaluation algorithms,our proposed algorithm reduced the variance and weakened the constraints of the environment.We further presented an off-policy evaluation algorithm based on key trajectory deviation constraints(KS-OPE),aiming at homogenizing all states during policy selection under the complex and variable application environment of reinforcement learning.Our algorithm made full use of the potential information in the trajectory data,utilized the Apriori algorithm to mine the key states,increased the deviation constraint of the key states,and improved the ability of the off-policy evaluation algorithm to identify better strategies when making policy selection.For the method proposed above,we carried out experimental verification in three common environments for thesiss in this field,Model Fail,Model Win and Grid World,and two more complex environments,Fappybird and Space Invaders-v0.During our experiment,we compared a variety of different off-policy evaluation algorithms,results proved that the method proposed in our thesis has the ability to conduct policy evaluation more accurately.Then,the applicability of the two algorithms is compared and analyzed in our experiment.The performance of the two algorithms is up to 99% and 87.9% better than that of the worst estimator,respectively.Results show that the off-policy evaluation algorithm oriented to long trajectory data is more suitable for the environment of long trajectory data,and the off-policy evaluation algorithm based on key trajectory deviation constraint also reached our goal of better identifying the optimal policy.
Keywords/Search Tags:artificial intelligence, reinforcement learning, off-policy evaluation, importance sampling, data mining
PDF Full Text Request
Related items