Are you interested in the emerging field of intelligent decision-making? Our graduate seminar covers Sutton and Barto’s classic “Reinforcement Learning: An Introduction”, accessible for physics students.

This seminar provides an in-depth overview of methods for solving dynamic problems, from Markov Decision Processes to Deep Q-Networks. Participants will gain experience with computational techniques and experiment with trial-and-error interactions. We will connect theoretical ideas to real-world applications, demonstrating how reinforcement learning can benefit physics research and forms the basis for reasoning in large language models.


The first meeting will take place Wednesday April 9, 15:15, SR. 114 (ITP).

Talks

All talks take place in SR. 114 (Brüderstraße 16).


Wednesday 14.05.2025
15:15
Gemma Ghezzi
Multi-Armed Bandits: Foundations of Exploration–Exploitation
Slides
Wednesday 21.05.2025
15:15
Aleksandre Shukakidze
Foundations of MDPs and Dynamic Programming
Slides
Wednesday 28.05.2025
15:15
Jaekyu Yoon
Temporal-Difference Learning and Q-Learning
Slides
Wednesday 04.06.2025
15:15
Robin Bahl
Policy Gradient Methods and Actor-Critic Algorithms
Slides
Wednesday 11.06.2025
15:15
Luca Battiston
Deep Reinforcement Learning and DQN5. Deep Reinforcement Learning and DQN
Slides
Wednesday
18.06.2025
15:00
Cristian Duvan Sierra Diaz
Model-Based RL: Planning with Learned Models
Slides
Wednesday
18.06.2025
16:00
Zhen Liu
Learning Reward Functions from Human Preferences
Slides
Wednesday 02.07.2025
15:15
Oleksandr Kozak
RLHF: Fine-Tuning Language Models with Human Feedback
Slides
Wednesday 09.07.2025
15:15
Jiyu Huo
Hierarchical Reinforcement Learning and Temporal Abstraction
Slides

Course Information

Recommended Reading: Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning - An Introduction", The MIT Press, Cambridge, Massachusetts, London, England;

Topics

1. Multi-Armed Bandits: Foundations of Exploration–Exploitation

Description
Multi-armed bandit problems are a simplified setting of reinforcement learning that illustrate the crucial trade-off between exploration and exploitation. Instead of dealing with sequential states, an agent repeatedly chooses from a set of actions (the “arms”), aiming to maximize reward over time. This topic provides an accessible entry point for students to see how foundational ideas (like optimism in the face of uncertainty, or the upper confidence bound approach) are derived and analyzed. It also connects naturally to more complex RL settings that introduce state-based dynamics.
Key Sources
Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction, Chapter 2 (Multi-armed Bandits).
Auer, P. et al. (2002). “The Upper Confidence Bound (UCB) Algorithm” in Machine Learning.
Suggested Programming Task
Implement and compare two or three classic bandit algorithms (e.g., ϵ-Greedy, UCB, Thompson Sampling) on a synthetic K-armed bandit problem. Show how each algorithm balances exploration vs. exploitation, and visualize cumulative reward over time to demonstrate how more sophisticated methods can converge faster.

2. Foundations of MDPs and Dynamic Programming

Description
Introduces Markov Decision Processes (MDPs) as the formal framework for RL, defined by a 5-tuple (S, A, P, R, γ) representing states, actions, transition dynamics, rewards, and discount. Covers how Bellman equations underpin optimal value functions and policies. Explains dynamic programming solutions (like value iteration and policy iteration) for solving MDPs when the model is known. This topic builds the basis for understanding how optimal decisions are defined in RL.
Key Sources
Sutton & Barto (2018), Reinforcement Learning: An Introduction, Ch. 3; David Silver’s RL Course – Lecture 2: MDPs
Suggested Programming Task
Implement value iteration or policy iteration on a simple gridworld MDP to compute an optimal policy and value function.

3. Temporal-Difference Learning and Q-Learning

Description
Explores model-free reinforcement learning algorithms that learn value functions from experience. Focus on Q-Learning, a seminal off-policy TD control method that updates action-value estimates toward the Bellman optimality target. Introduces the Q-learning update rule and how it provably converges to the optimal Q-function under certain conditions (Watkins & Dayan, 1992). Also touches on concepts of exploration vs. exploitation and the $\epsilon$-greedy strategy in the context of Q-learning. This topic provides a foundation for understanding value-based learning without a model of the environment.
Key Sources
Sutton & Barto (2018), Ch. 6 (TD Learning); Watkins & Dayan (1992) – Q-learning algorithm
Suggested Programming Task
Implement the Q-learning algorithm on a small episodic task (e.g. Cliff Walking or a Gridworld) and demonstrate learning of the optimal policy by the agent over episodes.

4. Policy Gradient Methods and Actor-Critic Algorithms

Description
Covers policy-based reinforcement learning, where the policy is optimized directly via gradient ascent on expected return. Introduces the policy gradient theorem, which provides an efficient way to compute the policy’s gradient without modeling the state distribution. Discusses the REINFORCE algorithm (Monte Carlo policy gradient) and enhancements like baselines to reduce variance. The Actor-Critic approach is explained as combining a learned value function (critic) with a policy (actor) to stabilize and speed up learning. Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) are mentioned as practical improvements that constrain policy updates for stability.
Key Sources
Sutton & Barto (2018), Ch. 13 (Policy Gradient); Schulman et al. (2017) – PPO algorithm
Suggested Programming Task
Implement a basic policy gradient method (e.g. REINFORCE) on a simple control problem like CartPole. Extend it to an actor-critic algorithm and compare learning speed or stability, possibly reproducing the improvement from adding a critic.

5. Deep Reinforcement Learning and DQN

Description
Explores how neural networks are used as function approximators in RL. The seminal Deep Q-Network (DQN) algorithm is introduced, which was the first to achieve human-level performance on Atari 2600 games by combining Q-learning with deep convolutional neural networks. Key innovations of DQN are explained: experience replay (to break temporal correlations in training data) and a target network (to stabilize the moving Q-value target). This topic highlights how deep RL addresses large state spaces and discusses the impact of DQN (and its variants) on the field.
Key Sources
Mnih et al. (2015) – Human-level control through deep RL; IJCAI’19 article on DQN improvements
Suggested Programming Task
Implement a simplified DQN on a classic control task (e.g. CartPole or Lunar Lander). Demonstrate how experience replay and target networks improve stability, and possibly reproduce a learning curve comparable to the literature.

6. Model-Based RL: Planning with Learned Models

Description
Model-based RL integrates planning (using a model of the environment) with learning from experience. Unlike model-free methods, which directly learn value functions or policies, model-based RL first learns or is given a transition/reward model of the environment, then uses that model to simulate future interactions and plan optimal actions. This topic focuses on the Dyna architecture (Sutton, 1991) as a foundational approach: the agent learns a world model from real experience and then improves its policy by “imagining” simulated experiences. This can greatly improve data efficiency, as the agent is not limited to learning only from real trajectories.
Key Sources
Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction, Chapter 8 (Planning and Learning with Tabular Methods).
Sutton, R. S. (1991). “Dyna, an Integrated Architecture for Learning, Planning, and Reacting,” SIGART Bulletin.
Suggested Programming Task
Implement a simple Dyna-Q setup on a small gridworld or maze environment. Let the agent learn the environment’s transition model from real experience, then run simulated trajectories (planning updates) to accelerate policy learning. Compare performance (number of steps to reach the goal) with a purely model-free Q-learning agent, demonstrating that planning often yields faster convergence.

7. Self-Play and Strategy Games: AlphaGo to AlphaZero

Description
Studies how reinforcement learning, combined with search techniques, achieves superhuman performance in complex games. Focus on AlphaGo/AlphaZero, which learned to play Go (and chess/shogi) at championship level through self-play. AlphaGo used deep neural networks to evaluate game states and select moves, trained first on human expert data and then refined by reinforcement learning from self-play. The later AlphaGo Zero method improved by training tabula rasa (no human data), making the system its own teacher – iteratively improving via self-play and Monte Carlo Tree Search (MCTS) to plan moves. This topic explains the architecture of these systems (policy/value networks + MCTS) and how self-play reinforcement learning can outperform human strategic play.
Key Sources
Silver et al. (2017), “Mastering the game of Go without human knowledge” (Nature)
Suggested Programming Task
Develop a miniature self-play RL project for a simpler game (e.g. Tic-Tac-Toe or Connect-4). Implement a basic MCTS and a neural network that learns to evaluate positions, and have the agent play against itself to improve. Observe whether self-play yields an agent that cannot be easily beaten by a human in that simple domain.

8. Learning Reward Functions from Human Preferences

Description
Examines techniques for deriving a reward signal from human feedback instead of an explicit reward function. Introduces the concept of reward modeling or inverse reinforcement learning, where human preferences between different behavior trajectories are used to learn a reward function that the agent can then optimize. A key work is by Christiano et al. (2017), who showed that an agent can learn complex tasks (Atari games, simulated robotics) with high success by optimizing for a reward model trained on pairwise preferences from non-expert humans. This approach reduces the need for manual reward design and allows communicating complex goals through comparisons. The topic provides a foundation for understanding how human feedback can be integrated into the RL loop, which is crucial for aligning AI behavior with human intentions.
Key Sources
Christiano et al. (2017), “Deep RL from Human Preferences”
Ng & Russell (2000) – Inverse RL formulation.
Suggested Programming Task
Reproduce a simple preference-based learning setup – for example, train a reward model for an agent by asking a user to choose the better of two short trajectories on a simple task (like cart-pole balancing or maze navigation). Use the learned reward model to train the agent’s policy and compare its performance to one trained with a hand-crafted reward.

9. RLHF: Fine-Tuning Language Models with Human Feedback

Description
Focuses on Reinforcement Learning from Human Feedback (RLHF), a technique to align large language models (LLMs) with human preferences. The process involves three steps: (1) gather human demonstrations or rankings of model outputs, (2) train a reward model to predict human preference, and (3) fine-tune the LLM using an RL algorithm (often PPO) to maximize this learned reward. A key example is OpenAI’s InstructGPT, where a 1.3B-parameter GPT-3 model was fine-tuned with human feedback to follow instructions better. Remarkably, the resulting model produced outputs preferred by users over a 175B-parameter original GPT-3, with gains in truthfulness and reduction of toxic content. This topic delves into how RLHF addresses the alignment problem, discussing the challenges of stability (e.g. preventing reward hacking) and the improvements in model behavior from this fine-tuning approach.
Key Sources
Ouyang et al. (2022), “Training language models to follow instructions with human feedback”(InstructGPT)
Suggested Programming Task
As a mini demonstration of RLHF, take a pre-trained language model (e.g. GPT-2 small) and a proxy reward function (such as a sentiment classifier for output positivity). Implement a PPO fine-tuning loop to train the model to produce outputs that maximize the reward (for example, make the language model output more positive responses). Evaluate how the model’s outputs change in alignment with the given reward signal.

10. Hierarchical Reinforcement Learning and Temporal Abstraction

Description
Hierarchical RL addresses the challenge of long-horizon tasks by introducing temporal abstractions—sub-policies (often called options or skills) that operate over extended time scales. The Options Framework (Sutton, Precup & Singh, 1999) formalizes how these sub-policies can be discovered or designed and how an agent can plan or learn at multiple levels of abstraction. This drastically reduces the effective depth of decision-making, as high-level decisions can invoke entire sequences of low-level actions. This topic delves into how hierarchical structures can speed up learning, improve transfer across tasks, and reflect intuitive structures akin to human planning.
Key Sources
Sutton, R., Precup, D., & Singh, S. (1999). “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning,” Artificial Intelligence.
Barto, A. & Mahadevan, S. (2003). “Recent Advances in Hierarchical Reinforcement Learning,” Discrete Event Dynamic Systems.
Suggested Programming Task
Implement a basic hierarchical RL approach in a multi-room gridworld (each “room” requires local navigation). Define sub-policies (options) for room-level navigation, then a high-level policy that selects which sub-policy to invoke. Compare learning speed with a flat RL agent. If time permits, attempt an option-discovery heuristic that identifies subgoals (e.g., doorways between rooms) to highlight the benefit of temporal abstraction.

11. Reinforcement Learning for Reasoning in LLMs

Description
Explores cutting-edge research on using RL to improve the reasoning and problem-solving capabilities of large language models. This topic is exemplified by recent work like “Teaching Large Language Models to Reason with Reinforcement Learning” (Havrilla et al., 2023), which investigates applying various RL algorithms to enhance an LLM on complex reasoning tasks. Different reward schemes (sparse rewards for correct answers, dense rewards for intermediate steps, etc.) and algorithms (such as Proximal Policy Optimization versus an Expert-Iteration approach) are compared for their efficacy in improving model reasoning. The findings shed light on how far RL can push reasoning skills beyond what supervised training provides – for instance, one result noted that an expert-iteration method slightly outperformed PPO in sample efficiency for certain reasoning benchmarks. This talk covers the challenges of defining the reasoning task as an MDP (with the LLM’s generated tokens or thoughts as actions) and highlights the potential and limitations of RL in this domain.
Key Sources
Havrilla et al. (2023), “Teaching Large Language Models to Reason with RL”
Suggested Programming Task
If resources permit, set up a small-scale experiment of RL for reasoning – for example, train a smaller language model or symbolic agent to solve arithmetic or logic puzzles via trial-and-error, using correctness as a reward signal. Monitor whether an RL approach (perhaps with a form of search or lookahead) can improve the solution rate compared to supervised learning on the same puzzles. (This task is challenging and optional, intended to illustrate the research idea rather than fully reproduce it.)

2025