Reinforcement Learning Fundamentals
Reinforcement Learning is one of the three paradigms of machine learning — alongside supervised and unsupervised learning — and arguably the one most closely resembling how humans and animals actually learn. Rather than being told what the correct answer is, an RL agent discovers which actions yield the greatest reward by exploring its environment and learning from the consequences of its actions .
At its core, RL formalizes a simple idea: rewarding desired behaviors and punishing undesired ones causes an agent to shift its behavior toward the rewarded actions over time. This feedback loop between an agent and its environment creates a powerful learning framework that has produced breakthrough results in game playing (AlphaGo), robotics, autonomous vehicles, and resource management .
The fundamental challenge in RL is the exploration-exploitation tradeoff: the agent must exploit what it already knows to obtain reward, but it must also explore unknown actions to discover potentially better strategies in the future .
The diagram above illustrates the core RL loop: at each timestep, the agent observes a state, selects an action, and receives a reward and a new state from the environment.
Footnotes
-
Reinforcement Learning Basics - SmythOS - Core RL concepts including agent-environment interaction and algorithm categories. ↩
-
An Introduction to Reinforcement Learning: Fundamental Concepts and Practical Applications - arXiv - Comprehensive survey of RL fundamentals, MDPs, policy iteration, and value iteration. ↩
-
On-Policy vs Off-Policy Reinforcement Learning - CORE Robotics Lab, Georgia Tech - Detailed comparison of on-policy and off-policy methods with SARSA vs Q-learning analysis. ↩
Reinforcement Learning: Essential Concepts
The Agent–Environment Interface
The RL framework is built on the interaction between two entities:
- Agent: The learner or decision-maker. It observes the environment's state and selects actions according to a Policy.
- Environment: Everything external to the agent. It receives actions, transitions to new states, and emits reward signals .
This interface is modeled as a finite Markov Decision Process, which provides the formal foundation for nearly all RL problems.
Key Components of an MDP
An MDP is defined by the tuple :
| Component | Symbol | Description |
|---|---|---|
| State space | Set of all possible states | |
| Action space | Set of all possible actions | |
| Transition probability | Probability of transitioning to given and | |
| Reward function | Immediate reward received after transition | |
| Discount factor | Determines present value of future rewards |
The Markov Property is the critical assumption: the future is independent of the past given the present. Formally:
This "memorylessness" is what makes RL problems tractable — the agent need not remember the entire history of interactions .
Footnotes
-
Reinforcement Learning Basics - SmythOS - Core RL concepts including agent-environment interaction and algorithm categories. ↩
-
Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework. ↩
Rewards, Returns, and Value Functions
The Reward Signal defines the goal of the RL problem. However, the agent's true objective is not to maximize immediate reward, but to maximize the expected cumulative return over time.
The return is defined as the discounted sum of future rewards:
The Discount Factor controls the agent's horizon: values near 0 make the agent myopic (short-sighted), while values near 1 make the agent far-sighted.
Two types of Value Function are central to RL:
- State-value function : Expected return starting from state and following policy :
- Action-value function : Expected return starting from state , taking action , and then following policy :
The relationship between them is:
The Bellman Equations
The Bellman Equation is arguably the most fundamental equation in RL. It decomposes the value function into two parts: the immediate reward and the discounted value of the successor state 2.
Bellman Expectation Equation (for ):
Bellman Expectation Equation (for ):
The Bellman Optimality Equations express the value under the optimal policy :
The key insight: under the optimal policy, the value of a state equals the expected return from taking the best action — no averaging over a policy is needed .
This backup diagram represents the Bellman equation: the value at state is computed by looking ahead to all possible next states, weighted by their transition probabilities.
Footnotes
-
Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework. ↩
-
Markov Decision Processes and Bellman Equations - Towards Data Science - Thorough derivation and explanation of the Bellman equations from first principles. ↩ ↩2
The Reinforcement Learning Solution Process
- 1Step 1
Define the state space , action space , transition probabilities , reward function , and discount factor . This determines whether the problem has discrete or continuous states/actions, which heavily influences algorithm choice.
- 2Step 2
Decide whether to learn value functions (tabular arrays, neural networks) or directly parameterize the policy. Value-based methods estimate , while policy-based methods directly learn .
- 3Step 3
Based on the problem structure, choose between:
- Dynamic Programming (if model is known)
- Monte Carlo methods (if full episodes are available)
- Temporal Difference learning (bootstrapping from incomplete episodes)
- Policy Gradient methods (for continuous action spaces)
- 4Step 4
The agent selects actions (using its current policy, often with exploration via -greedy or action noise), observes next states and rewards, and collects experience tuples .
- 5Step 5
Apply the learning update rule specific to the chosen algorithm (e.g., Q-learning update, TD error, or policy gradient ascent). This step uses collected experience to improve the agent's estimates.
- 6Step 6
Evaluate the learned policy's performance, adjust hyperparameters (learning rate, exploration rate), and repeat the interaction–update loop until convergence or satisfactory performance is achieved.
Temporal Difference Learning
Temporal Difference Learning is the central idea behind the most widely used RL algorithms. TD learning combines the sampling of Monte Carlo methods with the bootstrapping of dynamic programming — it updates estimates using other learned estimates without waiting for the final outcome .
The simplest TD update rule (TD(0)) for state values is:
The quantity is called the TD Error, and it drives all TD-based learning.
Footnotes
-
Part 1: Key Concepts in RL - OpenAI Spinning Up - OpenAI's authoritative introduction to RL terminology, notation, and core mathematical framework. ↩
Q-Learning is a model-free, off-policy algorithm that directly approximates the optimal Q-function regardless of the policy being followed .
Update rule:
Key characteristics:
- Uses the maximum Q-value of the next state for updates (off-policy)
- Converges to the optimal Q-function regardless of exploration strategy
- Can learn from old experience (experience replay)
- Tends to be more optimistic — assumes optimal future behavior
- Works best in discrete action spaces
Footnotes
-
Reinforcement Learning part 2: SARSA vs Q-learning - studywolf - Intuitive and practical comparison of SARSA and Q-learning with the cliff-walking example. ↩
Q-Learning vs SARSA: Key Characteristics
Comparison across several dimensions
Policy Gradient Methods
While value-based methods like Q-learning learn an action-value function and derive a policy from it, Policy Gradient methods directly parameterize and optimize the policy .
The objective is to maximize the expected cumulative reward:
Using the Policy Gradient Theorem (Sutton et al., 1999), the gradient can be expressed as:
This means we can update the policy parameters by increasing the probability of actions that led to high returns:
The REINFORCE algorithm (Williams, 1992) is the simplest policy gradient method — it uses Monte Carlo returns after each complete episode . While conceptually elegant, it suffers from high variance. Actor-Critic methods address this by using a learned value function (the critic) to reduce variance in the gradient estimate .
Footnotes
-
Policy Gradient Algorithms - Lil'Log (Lilian Weng) - Extensive catalog of policy gradient methods including REINFORCE, Actor-Critic, DDPG, PPO, and SAC. ↩ ↩2
-
What is Deep Reinforcement Learning? - IBM - Overview of deep RL algorithms, actor-critic architecture, and the exploration-exploitation balance. ↩
Evolution of Reinforcement Learning
Dynamic Programming
1957Richard Bellman introduces the Bellman equation and dynamic programming, establishing the mathematical foundations for sequential decision-making under uncertainty."
Q-Learning
1989Watkins introduces Q-learning, the first provably convergent off-policy temporal difference control algorithm — a landmark in model-free RL."
REINFORCE
1992Ronald Williams publishes the REINFORCE algorithm, establishing the foundation for policy gradient methods in reinforcement learning."
Sutton & Barto Textbook
1998The seminal textbook 'Reinforcement Learning: An Introduction' is published, unifying the field and becoming the standard reference for decades."
Deep Q-Network (DQN)
2013Mnih et al. combine deep neural networks with Q-learning, achieving human-level performance on Atari games — launching the deep RL era."
AlphaGo
2016DeepMind's AlphaGo defeats world champion Lee Sedol at Go, combining deep RL with Monte Carlo tree search in a historic AI milestone."
PPO & SAC
2017Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are introduced, becoming the most widely used modern RL algorithms due to their stability and sample efficiency."
Deep Reinforcement Learning
Deep Reinforcement Learning emerged when researchers realized that tabular methods cannot scale to problems with large or continuous state spaces (e.g., raw pixel inputs, robotics) .
The Deep Q-Network (DQN) introduced several key innovations to stabilize training:
- Experience Replay: Store transitions in a replay buffer and sample random mini-batches to break temporal correlations
- Target Network: Use a separate, slowly-updated network for computing target Q-values to reduce moving-target instability
- Gradient Clipping: Clip the TD error to prevent gradient explosion
The DQN update becomes:
where are the frozen target network parameters and is the replay buffer.
Footnotes
-
Sutton & Barto Summary: Finite MDPs - Summary of Chapter 3 from the definitive RL textbook covering policies, value functions, and the Bellman equations. ↩
Common Questions & Edge Cases
Choosing the Right Algorithm
Start simple! For tabular (small) environments, Q-learning or SARSA are excellent choices. For environments with visual (pixel) inputs and discrete actions, use DQN. For continuous action spaces (robotics, control), use PPO or SAC. PPO is often the best default choice — it strikes a good balance between sample efficiency, stability, and implementation simplicity.
The Reward Hypothesis Trap
Sutton's reward hypothesis states that all goals can be described by maximization of expected cumulative reward. While powerful, reward specification is notoriously hard in practice. Poorly designed reward functions lead to unintended behavior — agents find 'shortcuts' that maximize reward without achieving the intended goal (e.g., a robot pausing indefinitely to avoid negative rewards). Always validate that your reward signal aligns with your true objective.
RL Algorithm Family Comparison
Qualitative assessment across key dimensions
Reinforcement Learning Key Concepts
Knowledge Check
In a Markov Decision Process, what does the transition function represent?
Explore Related Topics
React Roadmap: From Fundamentals to Advanced Mastery
The React ecosystem has matured into one of the most dominant forces in modern web development. With React 19 introducing Server Components, Server Actions, and a host of new hooks, the framework continues to evolve rapidly. This roadmap provides a structured, stage-by-stage learning path — from fou
Startup Fundamentals
The course outlines core concepts and stages to build, fund, and scale a startup.
- Define the problem, craft a value proposition, and validate with an MVP.
- Confirm product‑market fit using the “disappointed customer” rule and .
- Monitor burn rate, maintain a 12‑18 month runway, and control cash flow.
- Use a cap table to track equity, prevent over‑dilution, and prepare transparent pitch decks.