Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/99999/fk4k660p1h
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Wang, Mengdi | |
dc.contributor.author | Duan, Yaqi | |
dc.contributor.other | Operations Research and Financial Engineering Department | |
dc.date.accessioned | 2022-06-15T15:17:35Z | - |
dc.date.available | 2022-06-15T15:17:35Z | - |
dc.date.created | 2022-01-01 | |
dc.date.issued | 2022 | |
dc.identifier.uri | http://arks.princeton.edu/ark:/99999/fk4k660p1h | - |
dc.description.abstract | Policy evaluation is a central problem in batch reinforcement learning. It refers to the assessment of a given decision policy using logged data. Practical methods for policy evaluation typically involve some form of function approximation so as to relieve issues caused by the enormous scales of real-world systems and the length of planning horizons. The thesis is primarily concerned with statistical analysis of policy evaluation and show how function approximation improves the efficacy of methods. In the first part of the thesis, we consider off-policy evaluation with linear function approximation. We show the equivalence of a regression-based fitted Q-iteration method, marginalized importance sampling methods and a model-based method that estimates a conditional mean embedding of the transition operator. Moreover, our theory reveals that the hardness of off-policy evaluation is determined by the mismatch between data and target distributions, which is reflected by a projected chi-square-divergence in error bounds. We prove that the estimators are minimax optimal in terms of sample size, planning horizon and the mismatch term. In the second part of the thesis, we study on-policy evaluation with kernel methods. In particular, we are interested in a regularized form of the kernel least-squares temporal-difference (LSTD) estimate. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size and the effective horizon. Our analysis sheds light on how we could tune the complexity of function class to favorably strike a balance between the curse of dimension and the curse of horizon in policy evaluation. | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | |
dc.publisher | Princeton, NJ : Princeton University | |
dc.relation.isformatof | The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a> | |
dc.subject | distributional shift | |
dc.subject | kernel methods | |
dc.subject | linear function approximation | |
dc.subject | policy evaluation | |
dc.subject | reinforcement learning | |
dc.subject | temporal difference learning | |
dc.subject.classification | Operations research | |
dc.subject.classification | Statistics | |
dc.subject.classification | Computer science | |
dc.title | Policy Evaluation in Batch Reinforcement Learning | |
dc.type | Academic dissertations (Ph.D.) | |
pu.date.classyear | 2022 | |
pu.department | Operations Research and Financial Engineering | |
Appears in Collections: | Operations Research and Financial Engineering |
Files in This Item:
File | Size | Format | |
---|---|---|---|
Duan_princeton_0181D_14137.pdf | 2.92 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.