Prometheus in Production

Prometheus in Production

Verified Sources
Jun 23, 2026

Running Prometheus in production is fundamentally different from running it in development. At scale, you face challenges around high availability, cardinality management, long-term storage, and operational resilience that require deliberate architectural decisions.

Prometheus was designed with a philosophy of locality and simplicity: each Prometheus server scrapes metrics from targets it can reach directly, stores data locally in a TSDB, and evaluates alerts and recording rules independently. This design trades distributed consensus for operational simplicity — but it also means that scaling beyond a single instance requires you to extend the architecture thoughtfully.

The diagram above illustrates a typical production topology: shard by function, send to long-term storage via remote write, and optionally federate for cross-cluster visibility. Each decision in this architecture — sharding strategy, storage backend, HA approach — comes with trade-offs that we'll explore in depth 2.

Footnotes

  1. How to Build a Scalable Prometheus Architecture - Logz.io guide on scaling Prometheus with federation and remote write

  2. Prometheus Federation Explained: Architecture & Pitfalls - Groundcover deep dive on federation architecture and challenges

Scaling Stages: Where Are You?

Not every organization needs the same Prometheus architecture. Your approach should match your scale:

Active SeriesTypical ArchitectureKey Challenges
< 100KSingle Prometheus instanceMinimal — focus on alerting quality
100K – 1MSingle instance + vertical scalingMemory pressure, storage growth
1M – 5MFunctional sharding + remote writeCardinality explosions, query performance
5M – 50MMultiple shards + LTS backend (Thanos/Mimir)Cross-shard queries, deduplication
50M+Full distributed stack (Mimir/Thanos/Cortex)Multi-tenancy, cost management, ingestion reliability

At the 1–5 million active series mark, a single instance becomes viable only with aggressive optimization — functional sharding (one instance for infrastructure, one for application metrics, one for business metrics) and remote write to object storage for long-term retention .

Footnotes

  1. Prometheus Scalability: High Cardinality And How To Fix It - Scaling stages from 100K to 50M+ active series

Setting Up a Production-Grade Prometheus Architecture

  1. 1
    Step 1

    Decompose your monitoring workload into independent Prometheus instances. Common sharding strategies include:

    • Infrastructure metrics: node_exporter, kube-state-metrics, cadvisor
    • Application metrics: custom application metrics from instrumented services
    • Business metrics: SLI/SLO calculations, business KPIs

    This provides horizontal scaling via functional decomposition and limits the blast radius of any single instance failing.

  2. 2
    Step 2

    Run two identical Prometheus replicas scraping the same targets. Use external_labels to identify each replica. This ensures that if one replica fails, the other continues collecting and evaluating alerts. The key challenge is deduplication at query time — handle this via:

    • Thanos: built-in deduplication via --deduplication.replica-label
    • Mimir: automatic deduplication based on replica labels
    • Cortex: configurable deduplication

    Note: both replicas will send alerts, so configure Alertmanager with group_by and inhibition rules to avoid duplicate notifications.

  3. 3
    Step 3

    Configure remote_write to forward metrics from each Prometheus shard to a long-term storage backend. This decouples local retention (short, fast) from long-term retention (cheap, deep).

    1remote_write: 2 - url: "https://thanos-receive.example.com/api/v1/receive" 3 queue_config: 4 capacity: 100000 5 max_samples_per_send: 10000 6 batch_send_deadline: 5s 7 min_shards: 1 8 max_shards: 10

    Tune queue parameters based on your ingestion rate. Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds to detect backpressure .

    Footnotes

    1. Optimizing Prometheus Remote Write Performance - Last9 guide on queue tuning, cardinality management, and relabeling

  4. 4
    Step 4

    Use the /federate endpoint to pull pre-aggregated metrics from leaf Prometheus instances into a global view. This is pull-based and selective — only federate recording rule outputs, not raw high-cardinality metrics.

    1scrape_configs: 2 - job_name: 'federate' 3 scrape_interval: 30s 4 honor_labels: true 5 metrics_path: '/federate' 6 params: 7 'match[]': 8 - '{job=~".+"}' 9 - '{__name__=~"job:.+"}' 10 static_configs: 11 - targets: 12 - 'prometheus-leaf-1:9090' 13 - 'prometheus-leaf-2:9090'

    Federation is best suited for hierarchical aggregation, not as a replacement for remote write .

    Footnotes

    1. Prometheus Federation Explained: Architecture & Pitfalls - Groundcover deep dive on federation architecture and challenges

  5. 5
    Step 5

    Configure retention based on environment needs. Production typically uses 30–60 days locally, with long-term storage handling anything beyond that.

    • Development: 3–7 days (~10GB)
    • Staging: ~14 days (~25GB)
    • Production: 30–60 days (100GB+)
    • Compliance: 1+ years (external storage required)

    Use the --storage.tsdb.retention.size and --storage.tsdb.retention.time flags. Reducing retention requires a restart and cannot be undone — data older than the new period is immediately purged .

    Footnotes

    1. How to Configure and Optimize Prometheus Data Retention - Retention settings per environment with storage guidance

Cardinality: The Silent Killer

Cardinality is the single most impactful factor on Prometheus performance in production. Each unique combination of label values creates a separate time series, requiring its own chunk in memory and on disk.

Consider a metric http_requests_total with labels instance (3 values), method (5 values), and endpoint (1,000 values):

Cardinality=3×5×1000=15,000 series\text{Cardinality} = 3 \times 5 \times 1000 = 15{,}000 \text{ series}

That's 15,000 separate chunks in the TSDB index, each needing memory for in-memory head chunks, index entries in postings and symbols, and disk I/O for compaction and querying. At a scrape interval of 15s, this generates approximately 4 million samples per hour across all series. A real-world optimization at one company reduced their active series from 10M to 877K — a 92% memory reduction (from ~60GB to <5GB) — simply by removing unnecessary labels and metrics 2.

Detecting High Cardinality

Use Prometheus's built-in introspection tools:

Metric / ToolPurpose
prometheus_tsdb_head_seriesCurrent number of active series in the head block
prometheus_tsdb_head_chunks_created_totalRate of new chunk creation
top(10, count by (\_\_name\_\_)({...}))Top 10 metric names by series count
top(10, count by (job)({...}))Top 10 jobs by series count
label_join / label_replaceIdentify which labels contribute most to cardinality

Footnotes

  1. Understanding and Optimizing Resource Consumption in Prometheus - Palark case study: 92% memory reduction through cardinality optimization

  2. How to Manage High Cardinality Metrics in Prometheus and Kubernetes - Grafana Labs guide on cardinality management strategies

Prometheus Memory Usage by Cardinality Level

Approximate memory consumption at different active series counts

1# Drop high-cardinality endpoint label at scrape time 2scrape_configs: 3 - job_name: 'my-app' 4 metrics_path: /metrics 5 static_configs: 6 - targets: ['app:8080'] 7 metric_relabel_configs: 8 # Drop the endpoint label to reduce cardinality 9 - source_labels: [endpoint] 10 regex: '/api/v[0-9]+/users/\d+' 11 action: drop 12 # Drop unused metrics entirely 13 - source_labels: [__name__] 14 regex: 'go_gc_duration_seconds.*' 15 action: drop

Cardinality Explosion Warning

Never add labels with unbounded values (user IDs, request IDs, IP addresses) to Prometheus metrics. A single metric with a user_id label and 100,000 users creates 100,000 time series. Instead, aggregate at the application level and expose pre-computed metrics, or use exemplars to attach trace IDs without increasing series count.

Long-Term Storage Solutions: Deep Comparison

Recording Rules: Pre-Computing for Performance

Recording rules are the most underutilized tool in the Prometheus toolbox at scale. They allow you to:

  1. Reduce query latency: Dashboards query a pre-computed recording rule instead of running expensive PromQL at render time
  2. Reduce cardinality: Aggregate away high-cardinality dimensions (e.g., drop endpoint while keeping job and method)
  3. Compose complex expressions: Break multi-stage calculations into named, debuggable intermediates

Naming convention is critical for maintainability. The Prometheus community recommends the pattern:

level:metric:operations\text{level:metric:operations}

For example: job:http_requests:rate5m, service:availability:ratio_rate5m, cluster:node_cpu:avg_5m

This convention encodes the aggregation level, the source metric, and the operation applied — making it immediately clear what a recording rule produces and how it can be consumed by dashboards and alerts .

Rule Organization Best Practices

Structure your rule files logically to support team ownership and independent deployment:

/etc/prometheus/rules/
├── recording_rules/
│   ├── infrastructure.yml
│   ├── application.yml
│   └── sli_slo.yml
└── alerting_rules/
    ├── critical_alerts.yml
    └── warning_alerts.yml

Always validate rules before deploying using promtool check rules <file>. In Kubernetes environments, use the PrometheusRule custom resource provided by the Prometheus Operator for declarative rule management .

Footnotes

  1. Prometheus Recording Rules Documentation - Official Prometheus docs on recording and alerting rule syntax 2

Prometheus Production Maturity Lifecycle

Single Instance

Stage 1

Deploy a single Prometheus server with default config. Focus on scrape coverage, basic alerting, and Grafana dashboards. Suitable for <100K active series."

Vertical Scaling & Optimization

Stage 2

Increase resources, tune scrape intervals, add recording rules, manage cardinality. Introduce metric_relabel_configs to drop unneeded metrics. Suitable for 100K–1M active series."

Functional Sharding

Stage 3

Split into multiple Prometheus instances by domain (infra, app, business). Add remote_write for long-term storage. Suitable for 1M–5M active series."

HA Replicas & Deduplication

Stage 4

Deploy replica Prometheus instances for high availability. Configure external_labels for replica identification. Set up deduplication at the LTS layer (Thanos/Mimir)."

Distributed Observability Stack

Stage 5

Full Thanos/Mimir/Cortex deployment with query frontends, caching, multi-tenancy. Federation for global views. 5M–50M+ active series. Dedicated SRE effort for observability infrastructure."

Remote Write Queue Tuning

Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds — if it falls behind time(), your queues are backing up. Increase max_shards or max_samples_per_send incrementally. Never set max_shards too high without monitoring network utilization, as this can overwhelm the receiver and cause cascading failures .

Footnotes

  1. Optimizing Prometheus Remote Write Performance - Last9 guide on queue tuning, cardinality management, and relabeling

High Availability: Making Prometheus Resilient

Prometheus has no built-in clustering — HA is achieved by running redundant, independent instances. The key design principle is: each replica scrapes the same targets independently and evaluates the same rules independently. This means:

  • No shared state: Each replica maintains its own TSDB
  • No split-brain risk: There's no consensus protocol to worry about
  • Duplicate data is expected: Deduplication happens at the query layer (Thanos/Mimir/Cortex)

External labels are essential for HA deduplication. Configure unique replica labels:

1global: 2 external_labels: 3 cluster: 'us-east-1' 4 replica: 'prom-1'

The LTS backend uses the replica label to identify and deduplicate data from both instances .

Footnotes

  1. Prometheus High Availability (HA) - New Relic documentation on HA configuration with external labels

Monitoring Prometheus Itself

A production Prometheus that isn't monitored is a ticking time bomb. Key self-monitoring metrics:

MetricWhat It Tells YouAction Threshold
prometheus_tsdb_head_seriesActive series count> 80% of historical peak
process_resident_memory_bytesMemory consumption> 80% of available RAM
prometheus_remote_storage_queue_highest_sent_timestamp_secondsRemote write lag> 60s behind time()
prometheus_target_sync_length_seconds_sumSD processing overheadIncreasing trend
prometheus_tsdb_compactions_totalCompaction workloadSustained high rate
prometheus_rule_evaluation_duration_secondsRule evaluation timep99 > 1s
prometheus_sd_refresh_duration_secondsService discovery latencyIncreasing trend

Set alerts on:

  • Memory pressure: process_resident_memory_bytes > 0.8 * available_memory
  • Remote write backlog: time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300
  • Scrape failures: up == 0 for critical targets for > 5 minutes
  • Rule evaluation delays: prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 10

Prometheus in Production — Key Concepts

1 / 5
20%
Question · Term

What is the primary mechanism for scaling Prometheus beyond a single instance?

Click to reveal
Answer · Definition

Functional sharding — splitting metrics by domain (infra, app, business) across separate Prometheus instances, combined with remote_write for long-term storage and federation for cross-cluster views.

Long-Term Storage Solutions Comparison

Evaluated across 6 dimensions for production deployment

Knowledge Check

Question 1 of 5
Q1Single choice

You notice that a single Prometheus instance is consuming 50GB of memory with 10 million active series. What is the most effective first step to reduce resource consumption?

Explore Related Topics

1

Master Class: Kubernetes Fundamentals

Kubernetes is the industry‑standard platform for orchestrating containerized microservices, separating cluster management (Control Plane) from workload execution (Worker Nodes) and emphasizing declarative, version‑controlled deployments.

  • The Control Plane (kube‑apiserver, etcd, scheduler, controller‑manager) stores the cluster’s desired state and makes global scheduling decisions.
  • Worker nodes run kubelet, kube‑proxy, and a container runtime to host Pods and enforce networking rules.
  • Core Kubernetes objects—Pods, Services, and Deployments—enable self‑healing, stable networking, and scalable rollouts.
  • Declarative YAML manifests (kubectl apply) support IaC and GitOps, while imperative commands are discouraged.
  • Production workloads should use higher‑level abstractions (Deployments/StatefulSets) instead of bare Pods to ensure resilience.
2

Java Roadmap 2026: From Core Language to Production-Ready Professional

2026 Java roadmap outlines language, frameworks, concurrency, AI, and AOT skills for production‑ready developers.

  • Java 25 LTS is the current baseline; Oracle now follows a 2‑year LTS cycle (next LTS Java 29 in 2027).
  • Virtual threads and Structured Concurrency (Project Loom) simplify high‑scale I/O, reducing the need for reactive libraries.
  • Spring Boot 4/Spring 7 with Spring AI and LangChain4j make LLM integration essential.
  • Choose GraalVM Native Image for native binaries or Project Leyden AOT caching for 40‑60 % faster JVM startup, based on compatibility vs. startup speed.