File(s) under embargo
until file(s) become available
Policy Evaluation in Statistical Reinforcement Learning
While Reinforcement Learning (RL) has achieved phenomenal success in diverse fields in recent years, the statistical properties of the underlying algorithms are still not fully understood. One key aspect in this regard is the evaluation of the value associated with the RL agent. In this dissertation, we propose two statistically sound methods for policy evaluation and inference, and study their theoretical properties within the RL setting.
In the first work, we propose an online bootstrap method for statistical inference in policy evaluation. The bootstrap is a flexible and efficient approach for inference in online learning, but its efficacy in the RL setting has yet to be explored. Existing methods for online inference are restricted to settings involving independently sampled observations. In contrast, our method is shown to be distributionally consistent for statistical inference in policy evaluation under Markovian noise, which is a standard assumption in the RL setting. To demonstrate the effectiveness of our method in practical applications, we include several numerical simulations involving the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms across a range of real RL environments.
In the second work, we propose a tensor Markov Decision Process framework for modeling the evolution of a sequential decision-making process when the state-action features are tensors. Under this framework, we develop a low-rank tensor estimation method for off-policy evaluation in batch RL. The proposed estimator approximates the Q-function using a tensor parameter embedded with low-rank structure. To overcome the challenge of nonconvexity, we introduce an efficient block coordinate descent approach with closed-form solutions to the alternating updates. Under standard assumptions from the tensor and RL literature, we establish an upper bound on the statistical error which guarantees a sub-linear rate of computational error. We provide numerical simulations to demonstrate that our method significantly outperforms standard batch off-policy evaluation algorithms when the true parameter has a low-rank tensor structure.
- Doctor of Philosophy
- West Lafayette