Detection and Recovery
Learning Goals
- Construct a Wait-for Graph from a Resource-Allocation Graph for single-instance resource systems.
- Apply the Banker's Detection Algorithm to detect deadlocks in multi-instance resource systems.
- Analyze the trade-offs of different detection frequencies (periodic vs. on-demand).
- Evaluate recovery strategies: process termination (abort all vs. abort one-by-one) and resource preemption (victim selection, rollback, starvation prevention).
Deadlock Detection: An Overview
Deadlock detection is the process of determining whether a deadlock currently exists in the system. Unlike prevention and avoidance, detection allows the system to enter deadlock states — but then detects them and takes corrective action.
When to use detection:
- Prevention is too costly or restrictive.
- Avoidance is impossible because maximum resource needs are unknown.
- The system is a general-purpose OS (like Linux or Windows) with unpredictable workloads.
Detection methods depend on resource type:
| Resource Type | Detection Method | Complexity |
|---|---|---|
| Single-instance (each resource type has exactly 1 instance) | Wait-for Graph (WFG) — check for cycles | O(n²) using DFS |
| Multi-instance (each resource type can have multiple instances) | Banker's Detection Algorithm — matrix reduction | O(m × n²) |
Detection for Single-Instance Resources: Wait-for Graph
For systems where each resource type has exactly one instance, a deadlock can be detected by constructing a Wait-for Graph (WFG) from the Resource-Allocation Graph (RAG).
Transforming RAG to WFG
The WFG is derived from the RAG by removing resource nodes and connecting processes directly:
- For each pair (Pᵢ, Rₖ, Pⱼ) where Pᵢ is waiting for Rₖ (request edge Pᵢ → Rₖ) and Rₖ is allocated to Pⱼ (assignment edge Rₖ → Pⱼ), draw a wait-for edge from Pᵢ → Pⱼ.
- Remove all resource nodes from the graph.
- The resulting graph has only process nodes and directed edges representing "is waiting for".
Before (RAG): The Resource-Allocation Graph shows processes connected through resources.
After (WFG): The Wait-for Graph removes resource nodes and connects processes directly.
The cycle P₁ → P₂ → P₁ in the WFG indicates a deadlock.
Cycle Detection
| Graph property | Meaning |
|---|---|
| WFG has no cycles | No deadlock exists. |
| WFG contains one or more cycles | Deadlock exists. Every process in the cycle is deadlocked. |
Example Transformation:
| RAG Edges | WFG Edge | Meaning |
|---|---|---|
| P₁ → R₂ (request) + R₂ → P₂ (assignment) | P₁ → P₂ | P₁ is waiting for P₂ |
| P₂ → R₁ (request) + R₁ → P₁ (assignment) | P₂ → P₁ | P₂ is waiting for P₁ |
| Result: Cycle P₁ → P₂ → P₁ — Deadlock detected. |
Detection Algorithm (DFS-Based)
- For each process Pᵢ in the WFG, start a Depth-First Search (DFS).
- Mark nodes as: White (unvisited), Gray (in current DFS path), Black (fully explored).
- If during DFS from Pᵢ we encounter a node that is already Gray, a cycle (deadlock) has been detected.
- If no Gray node is encountered during any DFS, no deadlock exists.
Deadlock Detection and Recovery — Animated Explanation
Detection for Multi-Instance Resources: Banker's Detection Algorithm
When resource types have multiple instances, the Wait-for Graph approach is insufficient (a cycle does not guarantee deadlock). Instead, we use the Banker's Detection Algorithm, which is similar to the Safety Algorithm but uses current request information rather than maximum need.
Data Structures
| Structure | Description |
|---|---|
| Available[m] | Number of available instances of each resource type. |
| Allocation[n][m] | Number of instances currently allocated to each process. |
| Request[n][m] | Current outstanding request of each process (what each process is currently waiting for). |
The Detection Algorithm
1. Let Work[m] = Available[m] // Temporary copy 2. Let Finish[n] = false for all processes 3. For each process Pᵢ where Finish[i] == false: if Request[i] ≤ Work, then: Work = Work + Allocation[i] // Pretend Pᵢ finishes and releases Finish[i] = true Go to step 3 and restart 4. After scanning all processes: if Finish[i] == false for any process, then Pᵢ is deadlocked.
Result Interpretation
| Condition | Meaning |
|---|---|
| All Finish[i] == true | No deadlock exists. The system can recover. |
| Some Finish[i] == false | Those processes are deadlocked. They will not be able to complete. |
Example: 3 Processes, 2 Resources
Given:
- Available = (0, 0)
- Allocation = P₀: (0, 1), P₁: (2, 0), P₂: (3, 0)
- Request = P₀: (0, 0), P₁: (2, 0), P₂: (0, 1)
Execution:
- Work = (0, 0). P₀: Request (0,0) ≤ (0,0) ✓ → P₀ finishes → Work = (0+0, 0+1) = (0, 1). Finish[0] = true.
- P₁: Request (2,0) ≤ (0,1) ✗ → P₁ blocked.
- P₂: Request (0,1) ≤ (0,1) ✓ → P₂ finishes → Work = (0+3, 1+0) = (3, 1). Finish[2] = true.
- P₁: Request (2,0) ≤ (3,1) ✓ → P₁ finishes → Work = (3+2, 1+0) = (5, 1). Finish[1] = true.
- All Finish[i] == true → No deadlock. Sequence: <P₀, P₂, P₁>.
PYQ Variation: Detection Using Allocation + Request Only (No Max) [2022 Q4b]
Some exam problems provide Allocation and Request matrices directly without a Max matrix. In this case, you use the same detection algorithm but treat Request as the current outstanding needs. Let's solve the 2022 PYQ.
Problem [2022 Q4b]: Single processor system, 3 resource types (X, Y, Z) with 5 units each, 3 processes. Find which process finishes first and last.
| Process | Allocation (X Y Z) | Request (X Y Z) |
|---|---|---|
| P₀ | 1 2 1 | 1 0 3 |
| P₁ | 2 0 1 | 0 1 2 |
| P₂ | 2 2 1 | 1 2 0 |
Step 1: Calculate Available = Total − Sum(Allocation) = (5,5,5) − (1+2+2, 2+0+2, 1+1+1) = (5,5,5) − (5,4,3) = (0, 1, 2).
Step 2: Apply Detection Algorithm. Work = (0, 1, 2).
- P₁: Request (0,1,2) ≤ Work (0,1,2) ✓ → P₁ finishes → Work = (0+2, 1+0, 2+1) = (2, 1, 3). P₁ finishes first.
- P₀: Request (1,0,3) ≤ Work (2,1,3) ✓ → P₀ finishes → Work = (2+1, 1+2, 3+1) = (3, 3, 4).
- P₂: Request (1,2,0) ≤ Work (3,3,4) ✓ → P₂ finishes → Work = (3+2, 3+2, 4+1) = (5, 5, 5). P₂ finishes last.
Result: All processes complete. Safe sequence: <P₁, P₀, P₂>. P₁ finishes first, P₂ finishes last.
Constructing a Wait-for Graph and Detecting Deadlocks
- 1Step 1
Given a system with 4 processes (P₁, P₂, P₃, P₄) and 4 resource types (R₁, R₂, R₃, R₄) each with one instance. RAG edges: P₁ requests R₁, R₁ assigned to P₂. P₂ requests R₂, R₂ assigned to P₃. P₃ requests R₃, R₃ assigned to P₄. P₄ requests R₄, R₄ assigned to P₁. P₂ requests R₁, R₁ assigned to P₂ (self-loop — P₂ wants R₁ which it already has? Actually this is redundant).
- 2Step 2
For each pair (Pᵢ requests Rₖ, Rₖ assigned to Pⱼ), draw Pᵢ → Pⱼ. Eliminate all resource nodes. The WFG now has only process nodes and directed edges.
- 3Step 3
P₁ requests R₁ (assigned to P₂) → P₁ → P₂. P₂ requests R₂ (assigned to P₃) → P₂ → P₃. P₃ requests R₃ (assigned to P₄) → P₃ → P₄. P₄ requests R₄ (assigned to P₁) → P₄ → P₁. So edges: P₁ → P₂, P₂ → P₃, P₃ → P₄, P₄ → P₁.
- 4Step 4
Start DFS from P₁: visit P₁ (gray), then P₂ (gray), then P₃ (gray), then P₄ (gray), then P₄ → P₁ — but P₁ is already gray! Cycle detected. A cycle in a single-instance WFG means deadlock. Processes P₁, P₂, P₃, P₄ are all deadlocked.
- 5Step 5
Now consider a different scenario: P₁ → P₂, P₂ → P₃, P₃ → P₄ (no P₄ → P₁). DFS from P₁: P₁ → P₂ → P₃ → P₄. No gray node encountered. No cycle. No deadlock.
- 6Step 6
Remember: The WFG method works ONLY for single-instance resources. For multi-instance resources, you must use the Banker's Detection Algorithm (matrix reduction method) described above.
Detection Frequency: When to Run Detection
Running deadlock detection is not free — it consumes CPU time and can degrade system performance. The frequency of detection involves a trade-off.
| Approach | Description | Pros | Cons |
|---|---|---|---|
| Periodic | Run detection at fixed time intervals (e.g., every 30 seconds, every hour). | Simple to implement; predictable overhead. | Deadlock may go undetected for the entire interval; resources tied up longer. |
| On-demand (event-driven) | Run detection only when a resource request fails (i.e., a process is blocked waiting for a resource). | Deadlocks are detected immediately when they occur. | Overhead spikes under contention; may run detection many times unnecessarily. |
| Hybrid | Run detection periodically, but also trigger it if too many processes become blocked within a short window. | Balances overhead and responsiveness. | More complex to implement. |
Guidelines:
- If deadlocks are rare → run detection infrequently (e.g., every hour or on admin request).
- If deadlocks are frequent → run detection more often (e.g., every few seconds or on each block).
- The cost of detection must be weighed against the cost of the deadlock (tied-up resources, frustrated users).
Deadlock Recovery: What to Do After Detection
Once a deadlock is detected, the system must recover — break the deadlock and allow processes to continue. There are two main approaches:
Option 1: Process Termination
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Abort all deadlocked processes | Kill every process involved in the deadlock simultaneously. | Simple, guaranteed to break the deadlock. | Wasted computation — all processes lose their work. High restart cost. |
| Abort one process at a time | Kill one process, check if deadlock is resolved; repeat if not. | Minimizes number of processes killed. | High overhead — re-run detection after each abort. Starvation possible if the same process is repeatedly selected. |
Selection factors for choosing which process to abort:
- Priority of the process (lower priority → kill first).
- Runtime — how long the process has been running (more runtime → more work lost).
- Resources held — how many and what type of resources the process holds (more resources → more expensive to restart).
- Resources still needed — how many more resources the process needs to complete.
- Number of processes that will need to be terminated.
Option 2: Resource Preemption
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Select a victim | Choose a process to preempt (take resources from). | No processes killed — work is not lost entirely. | Rollback/restart required; overhead of saving state. |
| Rollback | The preempted process must be rolled back to a safe checkpoint and restarted from there. | Minimizes reprocessing compared to full abort. | Requires checkpointing infrastructure; may need to reacquire locks. |
| Starvation prevention | Ensure the same process is not repeatedly chosen as the victim (e.g., include number of rollbacks in the selection cost factor). | Fairness — all processes eventually make progress. | Adds complexity to victim selection. |
Comparison: Termination vs. Preemption
| Factor | Process Termination | Resource Preemption |
|---|---|---|
| Work lost | Entire process execution up to the deadlock. | Only execution since the last checkpoint. |
| Complexity | Simple — just kill the process(es). | Complex — requires checkpoints, rollback, resource tracking. |
| Overhead | Restart cost (re-create process, reload data). | Checkpointing overhead (periodic state saving). |
| Fairness | May starve low-priority processes. | Can include rollback count in selection to ensure fairness. |
| When to use | When processes are short-lived or work can be easily re-created. | When processes are long-running and restarting is expensive (e.g., database transactions). |
Knowledge Check
What is a Wait-for Graph (WFG) used for?