Reinforcement Learning Fundamentals

Reinforcement Learning Fundamentals

Verified Sources
Jun 21, 2026

Reinforcement Learning is one of the three paradigms of machine learning — alongside supervised and unsupervised learning — and arguably the one most closely resembling how humans and animals actually learn. Rather than being told what the correct answer is, an RL agent discovers which actions yield the greatest reward by exploring its environment and learning from the consequences of its actions .

At its core, RL formalizes a simple idea: rewarding desired behaviors and punishing undesired ones causes an agent to shift its behavior toward the rewarded actions over time. This feedback loop between an agent and its environment creates a powerful learning framework that has produced breakthrough results in game playing (AlphaGo), robotics, autonomous vehicles, and resource management .

The fundamental challenge in RL is the exploration-exploitation tradeoff: the agent must exploit what it already knows to obtain reward, but it must also explore unknown actions to discover potentially better strategies in the future .

The diagram above illustrates the core RL loop: at each timestep, the agent observes a state, selects an action, and receives a reward and a new state from the environment.

Footnotes

  1. Reinforcement Learning Basics - SmythOS - Core RL concepts including agent-environment interaction and algorithm categories.

  2. An Introduction to Reinforcement Learning: Fundamental Concepts and Practical Applications - arXiv - Comprehensive survey of RL fundamentals, MDPs, policy iteration, and value iteration.

  3. On-Policy vs Off-Policy Reinforcement Learning - CORE Robotics Lab, Georgia Tech - Detailed comparison of on-policy and off-policy methods with SARSA vs Q-learning analysis.

Reinforcement Learning: Essential Concepts

The Agent–Environment Interface

The RL framework is built on the interaction between two entities:

  • Agent: The learner or decision-maker. It observes the environment's state and selects actions according to a Policy.
  • Environment: Everything external to the agent. It receives actions, transitions to new states, and emits reward signals .

This interface is modeled as a finite Markov Decision Process, which provides the formal foundation for nearly all RL problems.

Key Components of an MDP

An MDP is defined by the tuple S,A,P,R,γ\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle:

ComponentSymbolDescription
State spaceS\mathcal{S}Set of all possible states
Action spaceA\mathcal{A}Set of all possible actions
Transition probabilityP(ss,a)P(s' \mid s, a)Probability of transitioning to ss' given ss and aa
Reward functionR(s,a,s)R(s, a, s')Immediate reward received after transition
Discount factorγ[0,1)\gamma \in [0, 1)Determines present value of future rewards

The Markov Property is the critical assumption: the future is independent of the past given the present. Formally:

P(St+1St,At)=P(St+1S0,A0,,St,At)P(S_{t+1} \mid S_t, A_t) = P(S_{t+1} \mid S_0, A_0, \ldots, S_t, A_t)

This "memorylessness" is what makes RL problems tractable — the agent need not remember the entire history of interactions .

Footnotes

  1. Reinforcement Learning Basics - SmythOS - Core RL concepts including agent-environment interaction and algorithm categories.

  2. Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework.

Rewards, Returns, and Value Functions

The Reward Signal defines the goal of the RL problem. However, the agent's true objective is not to maximize immediate reward, but to maximize the expected cumulative return over time.

The return GtG_t is defined as the discounted sum of future rewards:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The Discount Factor γ\gamma controls the agent's horizon: values near 0 make the agent myopic (short-sighted), while values near 1 make the agent far-sighted.

Two types of Value Function are central to RL:

  • State-value function Vπ(s)V_\pi(s): Expected return starting from state ss and following policy π\pi:

Vπ(s)=Eπ[GtSt=s]V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s]

  • Action-value function Qπ(s,a)Q_\pi(s, a): Expected return starting from state ss, taking action aa, and then following policy π\pi:

Qπ(s,a)=Eπ[GtSt=s,At=a]Q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a]

The relationship between them is:

Vπ(s)=aπ(as)Qπ(s,a)V_\pi(s) = \sum_{a} \pi(a \mid s) \, Q_\pi(s, a)

The Bellman Equations

The Bellman Equation is arguably the most fundamental equation in RL. It decomposes the value function into two parts: the immediate reward and the discounted value of the successor state 2.

Bellman Expectation Equation (for VπV_\pi):

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V_\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_\pi(s') \right]

Bellman Expectation Equation (for QπQ_\pi):

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q_\pi(s, a) = \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s') Q_\pi(s', a') \right]

The Bellman Optimality Equations express the value under the optimal policy π\pi^*:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V_*(s) = \max_{a} \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma V_*(s') \right]

Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q_*(s, a) = \sum_{s'} P(s' \mid s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q_*(s', a') \right]

The key insight: under the optimal policy, the value of a state equals the expected return from taking the best action — no averaging over a policy is needed .

This backup diagram represents the Bellman equation: the value at state ss is computed by looking ahead to all possible next states, weighted by their transition probabilities.

Footnotes

  1. Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework.

  2. Markov Decision Processes and Bellman Equations - Towards Data Science - Thorough derivation and explanation of the Bellman equations from first principles. 2

The Reinforcement Learning Solution Process

  1. 1
    Step 1

    Define the state space S\mathcal{S}, action space A\mathcal{A}, transition probabilities PP, reward function RR, and discount factor γ\gamma. This determines whether the problem has discrete or continuous states/actions, which heavily influences algorithm choice.

  2. 2
    Step 2

    Decide whether to learn value functions (tabular arrays, neural networks) or directly parameterize the policy. Value-based methods estimate Q(s,a)Q(s,a), while policy-based methods directly learn πθ(as)\pi_\theta(a|s).

  3. 3
    Step 3

    Based on the problem structure, choose between:

    • Dynamic Programming (if model is known)
    • Monte Carlo methods (if full episodes are available)
    • Temporal Difference learning (bootstrapping from incomplete episodes)
    • Policy Gradient methods (for continuous action spaces)
  4. 4
    Step 4

    The agent selects actions (using its current policy, often with exploration via ε\varepsilon-greedy or action noise), observes next states and rewards, and collects experience tuples (st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1}).

  5. 5
    Step 5

    Apply the learning update rule specific to the chosen algorithm (e.g., Q-learning update, TD error, or policy gradient ascent). This step uses collected experience to improve the agent's estimates.

  6. 6
    Step 6

    Evaluate the learned policy's performance, adjust hyperparameters (learning rate, exploration rate), and repeat the interaction–update loop until convergence or satisfactory performance is achieved.

Temporal Difference Learning

Temporal Difference Learning is the central idea behind the most widely used RL algorithms. TD learning combines the sampling of Monte Carlo methods with the bootstrapping of dynamic programming — it updates estimates using other learned estimates without waiting for the final outcome .

The simplest TD update rule (TD(0)) for state values is:

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]

The quantity δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) is called the TD Error, and it drives all TD-based learning.

Footnotes

  1. Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework.

Q-Learning is a model-free, off-policy algorithm that directly approximates the optimal Q-function regardless of the policy being followed .

Update rule: Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t) \right]

Key characteristics:

  • Uses the maximum Q-value of the next state for updates (off-policy)
  • Converges to the optimal Q-function regardless of exploration strategy
  • Can learn from old experience (experience replay)
  • Tends to be more optimistic — assumes optimal future behavior
  • Works best in discrete action spaces

Footnotes

  1. Reinforcement Learning part 2: SARSA vs Q-learning - studywolf - Intuitive and practical comparison of SARSA and Q-learning with the cliff-walking example.

Q-Learning vs SARSA: Key Characteristics

Comparison across several dimensions

Policy Gradient Methods

While value-based methods like Q-learning learn an action-value function and derive a policy from it, Policy Gradient methods directly parameterize and optimize the policy πθ(as)\pi_\theta(a \mid s) .

The objective is to maximize the expected cumulative reward:

J(θ)=Eπθ[t=0TγtRt+1]J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t R_{t+1} \right]

Using the Policy Gradient Theorem (Sutton et al., 1999), the gradient can be expressed as:

θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s, a) \right]

This means we can update the policy parameters by increasing the probability of actions that led to high returns:

θt+1=θt+αθlogπθ(atst)Gt\theta_{t+1} = \theta_t + \alpha \, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, G_t

The REINFORCE algorithm (Williams, 1992) is the simplest policy gradient method — it uses Monte Carlo returns GtG_t after each complete episode . While conceptually elegant, it suffers from high variance. Actor-Critic methods address this by using a learned value function (the critic) to reduce variance in the gradient estimate .

Footnotes

  1. Policy Gradient Algorithms - Lil'Log (Lilian Weng) - Extensive catalog of policy gradient methods including REINFORCE, Actor-Critic, DDPG, PPO, and SAC. 2

  2. What is Deep Reinforcement Learning? - IBM - Overview of deep RL algorithms, actor-critic architecture, and the exploration-exploitation balance.

Evolution of Reinforcement Learning

Dynamic Programming

1957

Richard Bellman introduces the Bellman equation and dynamic programming, establishing the mathematical foundations for sequential decision-making under uncertainty."

Q-Learning

1989

Watkins introduces Q-learning, the first provably convergent off-policy temporal difference control algorithm — a landmark in model-free RL."

REINFORCE

1992

Ronald Williams publishes the REINFORCE algorithm, establishing the foundation for policy gradient methods in reinforcement learning."

Sutton & Barto Textbook

1998

The seminal textbook 'Reinforcement Learning: An Introduction' is published, unifying the field and becoming the standard reference for decades."

Deep Q-Network (DQN)

2013

Mnih et al. combine deep neural networks with Q-learning, achieving human-level performance on Atari games — launching the deep RL era."

AlphaGo

2016

DeepMind's AlphaGo defeats world champion Lee Sedol at Go, combining deep RL with Monte Carlo tree search in a historic AI milestone."

PPO & SAC

2017

Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are introduced, becoming the most widely used modern RL algorithms due to their stability and sample efficiency."

Deep Reinforcement Learning

Deep Reinforcement Learning emerged when researchers realized that tabular methods cannot scale to problems with large or continuous state spaces (e.g., raw pixel inputs, robotics) .

The Deep Q-Network (DQN) introduced several key innovations to stabilize training:

  1. Experience Replay: Store transitions in a replay buffer and sample random mini-batches to break temporal correlations
  2. Target Network: Use a separate, slowly-updated network for computing target Q-values to reduce moving-target instability
  3. Gradient Clipping: Clip the TD error to prevent gradient explosion

The DQN update becomes:

L(θ)=E(s,a,r,s)D[(r+γmaxaQθ(s,a)Qθ(s,a))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right)^2 \right]

where θ\theta^- are the frozen target network parameters and D\mathcal{D} is the replay buffer.

Footnotes

  1. Sutton & Barto Summary: Finite MDPs - Summary of Chapter 3 from the definitive RL textbook covering policies, value functions, and the Bellman equations.

Common Questions & Edge Cases

Choosing the Right Algorithm

Start simple! For tabular (small) environments, Q-learning or SARSA are excellent choices. For environments with visual (pixel) inputs and discrete actions, use DQN. For continuous action spaces (robotics, control), use PPO or SAC. PPO is often the best default choice — it strikes a good balance between sample efficiency, stability, and implementation simplicity.

The Reward Hypothesis Trap

Sutton's reward hypothesis states that all goals can be described by maximization of expected cumulative reward. While powerful, reward specification is notoriously hard in practice. Poorly designed reward functions lead to unintended behavior — agents find 'shortcuts' that maximize reward without achieving the intended goal (e.g., a robot pausing indefinitely to avoid negative rewards). Always validate that your reward signal aligns with your true objective.

RL Algorithm Family Comparison

Qualitative assessment across key dimensions

Reinforcement Learning Key Concepts

1 / 5
20%
Question · Term

What is an MDP?

Click to reveal
Answer · Definition

A Markov Decision Process is a mathematical framework defined by S,A,P,R,γ\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle that models sequential decision-making where outcomes depend on both chance and the agent's actions, satisfying the Markov property.

Knowledge Check

Question 1 of 5
Q1Single choice

In a Markov Decision Process, what does the transition function P(ss,a)P(s' \mid s, a) represent?