Collection of technical/paper notes related to reinforcement learning, with compact summary and detailed mathematical derivations.

7 months after


A notes collection of important papers with the focus on reinforcement learning. Each note contains a compact summary of key idea in the paper, that is which problem the paper tries to solve and how they solve it. Second part is the detailed mathematical derivations.

Structure of notes:

  • Compact summary of key idea: which problem the paper tries to solve ? How they solve it ?
  • Mathematical derivations of useful tools presented in the paper.
  • File type:
    • Non-math (easy and quick): Markdown and figures
    • Math-based: PDF+LaTeX

Main focus:

Based on my own research interests, we have following main focuses

  • Model-based RL
  • Optimization
  • Information theory

Table of contents


Policy Gradients

  • Silver et al., A2C
  • Mnih et al., IMPALA
  • Silver et al., DPG/DDPG
  • Kakade, Approximately Optimal Approximate Reinforcement Learning
  • Kakade, A Natural Policy Gradient
  • Schulman et al., TRPO/PPO
  • Ba et al., ACKTR
  • Schulman, GAE
  • Wang et al., ACER
  • Gu et al., Q-Prop
  • Gruslys et al., Reactor
  • Liu et al., Stein Variational Policy Gradient
  • Gu et al., Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning


  • Silver et al., DQN/Double DQN/Dueling DQN
  • Silver et al. Prioritized experience replay
  • Bellemare et al., A Distributional Perspective on Reinforcement Learning
  • Silver et al., Rainbow: Combining Improvements in DRL

Model-based RL & Planning

  • Doll et al., The ubiquity of model-based reinforcement learning
  • Tamar et al., Value Iteration Networks
  • Tamar et al., Learning Generalized Reactive Policies using Deep Neural Networks
  • Tamar et al., Learning Plannable Representations with Causal InfoGAN
  • Singh et al., Value Prediction Networks
  • Lin et al., Value Propagation Networks
  • Lee et al., Gated Path Planning Networks
  • Salakhutdinov et al., LSTM Iteration Networks: An Exploration of Differentialble Path Finding
  • Abbeel et al., Universal Planning Networks
  • Wierstra et al., Learning Dynamic State Abstractions for Model-Based Reinforcement Learning
  • Gal et al., Improving PILCO with Bayesian Neural Network Dynamics Models
  • Meger et al., Synthesizing Neural Network Controllers with Probabilistic Model-based Reinforcement Learning
  • Levine et al., Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
  • Wierstra et al., Learning model-based planning from scratch
  • Gu et al., Continuous Deep Q-Learning with Model-based Acceleration
  • Lecun et al., Model-Based Planning in Discrete and Continuous Actions
  • Silver et al., The Predictron: End-To-End Learning and Planning
  • Weber et al., Imagination-Augmented Agents for Deep Reinforcement Learning
  • Li et al., Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems
  • Chockalingam et al., Differentiable Neural Planners with Temporally Extended Actions
  • Mishra et al., Prediction and Control with Temporal Segment Models
  • Metz et al., Discrete Sequential Prediction of Continuous Actions for Deep RL
  • Moerland et al., Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning
  • Chiappa et al., Recurrent Environment Simulators
  • Vinyals et al., Metacontrol for adaptive imagination-based optimization
  • Gerstner et al., Efficient Model-based Deep Reinforcement Learning with Variational State Tabulation
  • Dinh et al., Learning Awareness Models
  • Abbeel et al., Model-ensemble Trust-Region Policy Optimization
  • Levine et al., Model-based Value Expansion for Efficient Model-Free Reinforcement Learning
  • Levine et al., Recall Traces: Backtracking Models for Efficient Reinforcement Learning
  • Levine et al., Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
  • Levine et al., Temporal Difference Models: Model-Free Deep RL for Model-Based Control
  • Gregor et al., Temporal Difference Variational Auto-Encoder
  • Abbeel et al., SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning
  • Scholkopft et al., Adaptive Skip Intervals: Temporal Abstraction for Recurrent Dynamical Models
  • Singh et al., Improving model-based RL with Adaptive Rollout using Uncertainty Estimation
  • Abbeel et al., Model-Based Reinforcement Learning via Meta-Policy Optimization

RL Theory

  • Osband et al., A Tutorial on Thompson Sampling (Journal version, 2018)
  • Osband et al. (More) efficient reinforcement learning via posterior sampling
  • Osband et al., Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
  • Nachum et al., Bridging the Gap Between Valud and Policy Based Reinforcement Learning
  • Bellemare et al., Increasing the Action Gap: New Operators for Reinforcement Learning
  • Tishby et al., A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes
  • Dai et al., SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
  • Meger et al., Addressing Function Approximation Error in Actor-Critic Methods
  • Schaul et al., Universal Value Function Approximators
  • Levine, Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Misc RL

  • Schmidhuber, PowerPlay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem
  • Dopamine: A Research Framework for Deep Reinforcement Learning
  • Salimans et al., Evolution Strategies as a Scalable Alternative to Reinforcement Learning
  • Silver et al., Memory-based control with recurrent neural networks
  • Rusu et al., Policy Distillation
  • Schulman et al., Teacher-Student Curriculum Learning
  • Rezende et al., Interaction Networks for Learning about Objects, Relations and Physics
  • Silver et al., Learning Continuous Control Policies by Stochastic Value Gradients
  • Silver et al., Continuous control with deep reinforcement learning
  • Osband et al., Randomized Prior Functions for Deep Reinforcement Learning
  • Clopath et al., Continual Reinforcement Learning with Complex Synapses
  • Lin et al., Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
  • Oudeyer et al., Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning
  • Oudeyer et al., Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration
  • Sutton et al., Between MDPs and Semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales
  • Silver et al., FeUdal Networks for Hierarchical Reinforcement Learning
  • Silver et al., Meta-Gradient Reinforcement Learning
  • Abbeel et al., Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments
  • Abbeel et al., Learning to Adapt: Meta-Learning for Model-Based Control
  • Schulman et al., On First-Order Meta-Learning Algorithms
  • Schaul et al., Learning to learn by gradient descent by gradient descent
  • Abbeel et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
  • Botvinick et al., Learning to Reinforcement Learn
  • Osband et al. Deep Exploration via Bootstrapped DQN
  • Abbeel et al., VIME: Variational Information Maximizing Exploration
  • Ostrovski et al., Count-Based Exploration with Neural Density Models
  • Tang et al., #Exploration: A Study of Count-based Exploration for Deep Reinforcement Learning
  • Fortunato et al., Noisy Networks for Exploration
  • Plappert et al., Parameter Space Noise for Exploration
  • Bellemare et al., Unifying Count-Based Exploration and Intrinsic Motivation
  • Levine et al., EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
  • Moerland et al., The Potential of the Return Distribution for Exploration in RL
  • Pineau et al., Randomized Value Functions via Multiplicative Normalizing Flows
  • Abbeel et al., Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
  • Riedmiller et al., Learning by Playing-Solving Sparse Reward Tasks from Scratch
  • Riedmiller et al., Maximum a Posteriori Policy Optimisation
  • Riedmiller et al., Graph networks as learnable physics engines for inference and control

Optimization & Variational Inference

  • Bottou, Stochastic Gradient Descent Tricks
  • Bottou et al., Optimization Methods for Large-Scale Machine Learning
  • Martens et al., Optimizing Neural Networks with Kronecker-factored Approximate Curvature
  • Barber et al., Variational Optimization
  • Grathwohl et al., Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
  • Blei et al., Variational Inference: A Review for Statisticians
  • Grosse et al., Noisy Natural Gradient as Variational Inference
  • Whiteson et al., DiCE: The Infinitely Differentiable Monte-Carlo Estimator
  • Maddison et al., The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
  • Gu et al., Categorical Reparameterization with Gumbel-Softmax
  • Barber et al., Stochastic Variational Optimization
  • Martens et al., New insights and perspectives on the natural gradient method

Misc ML

  • Normalization tricks
    • Batch norm, layer norm
  • Schon et al., Manipulating the Multivariate Gaussian Density
  • Dumoulin, A guide to convolution arithmetic for deep learning
  • Kingma et al., Auto-Encoding Variational Bayes
  • Rusu et al., Progressive Neural Networks
  • Kirkpatrick et al., Overcoming catastrophic forgetting in neural networks
  • Graves et al., Automated Curriculum Learning for Neural Networks
  • Blundell et al., Weight Uncertainty in Neural Networks
  • Blundell et al., Bayesian Recurrent Neural Networks
  • Gal et al., What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision ?
  • Hernandez-Lobato et al., Black-Box alpha-Divergence Minimization
  • Roeder et al., Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference
  • Lakshminarayanan et al., Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
  • Vetrov et al., Structured Bayesian Pruning via Log-Normal Multiplicative Noise
  • Rezende et al., Neural Processes


  • Hassabis et al., Neuroscience-Inspired Artificial Intelligence
  • Tenenbaum et al., Building machines that learn and think like people
  • Doll, The ubiquity of model-based reinforcement learning
  • Moser et al., Place cells, grid cells, and the brain's spatial representation system
  • Niv et al., Reinforcement learning in the brain

Related Repositories



machine learning and deep learning tutorials, articles and other resources ...



Summaries and notes on Deep Learning research papers ...



Deep Reinforcement Learning for Keras. ...



Tensorflow + Keras + OpenAI Gym implementation of 1-step Q Learning from "Async ...



Deep Learning and deep reinforcement learning research papers and some codes ...