Prometheus in Production
Running Prometheus in production is fundamentally different from running it in development. At scale, you face challenges around high availability, cardinality management, long-term storage, and operational resilience that require deliberate architectural decisions.
Prometheus was designed with a philosophy of locality and simplicity: each Prometheus server scrapes metrics from targets it can reach directly, stores data locally in a TSDB, and evaluates alerts and recording rules independently. This design trades distributed consensus for operational simplicity — but it also means that scaling beyond a single instance requires you to extend the architecture thoughtfully.
The diagram above illustrates a typical production topology: shard by function, send to long-term storage via remote write, and optionally federate for cross-cluster visibility. Each decision in this architecture — sharding strategy, storage backend, HA approach — comes with trade-offs that we'll explore in depth 2.
Footnotes
-
How to Build a Scalable Prometheus Architecture - Logz.io guide on scaling Prometheus with federation and remote write ↩
-
Prometheus Federation Explained: Architecture & Pitfalls - Groundcover deep dive on federation architecture and challenges ↩
Scaling Stages: Where Are You?
Not every organization needs the same Prometheus architecture. Your approach should match your scale:
| Active Series | Typical Architecture | Key Challenges |
|---|---|---|
| < 100K | Single Prometheus instance | Minimal — focus on alerting quality |
| 100K – 1M | Single instance + vertical scaling | Memory pressure, storage growth |
| 1M – 5M | Functional sharding + remote write | Cardinality explosions, query performance |
| 5M – 50M | Multiple shards + LTS backend (Thanos/Mimir) | Cross-shard queries, deduplication |
| 50M+ | Full distributed stack (Mimir/Thanos/Cortex) | Multi-tenancy, cost management, ingestion reliability |
At the 1–5 million active series mark, a single instance becomes viable only with aggressive optimization — functional sharding (one instance for infrastructure, one for application metrics, one for business metrics) and remote write to object storage for long-term retention .
Footnotes
-
Prometheus Scalability: High Cardinality And How To Fix It - Scaling stages from 100K to 50M+ active series ↩
Setting Up a Production-Grade Prometheus Architecture
- 1Step 1
Decompose your monitoring workload into independent Prometheus instances. Common sharding strategies include:
- Infrastructure metrics: node_exporter, kube-state-metrics, cadvisor
- Application metrics: custom application metrics from instrumented services
- Business metrics: SLI/SLO calculations, business KPIs
This provides horizontal scaling via functional decomposition and limits the blast radius of any single instance failing.
- 2Step 2
Run two identical Prometheus replicas scraping the same targets. Use
external_labelsto identify each replica. This ensures that if one replica fails, the other continues collecting and evaluating alerts. The key challenge is deduplication at query time — handle this via:- Thanos: built-in deduplication via
--deduplication.replica-label - Mimir: automatic deduplication based on replica labels
- Cortex: configurable deduplication
Note: both replicas will send alerts, so configure Alertmanager with
group_byand inhibition rules to avoid duplicate notifications. - Thanos: built-in deduplication via
- 3Step 3
Configure
remote_writeto forward metrics from each Prometheus shard to a long-term storage backend. This decouples local retention (short, fast) from long-term retention (cheap, deep).1remote_write: 2 - url: "https://thanos-receive.example.com/api/v1/receive" 3 queue_config: 4 capacity: 100000 5 max_samples_per_send: 10000 6 batch_send_deadline: 5s 7 min_shards: 1 8 max_shards: 10Tune queue parameters based on your ingestion rate. Monitor
prometheus_remote_storage_queue_highest_sent_timestamp_secondsto detect backpressure .Footnotes
-
Optimizing Prometheus Remote Write Performance - Last9 guide on queue tuning, cardinality management, and relabeling ↩
-
- 4Step 4
Use the
/federateendpoint to pull pre-aggregated metrics from leaf Prometheus instances into a global view. This is pull-based and selective — only federate recording rule outputs, not raw high-cardinality metrics.1scrape_configs: 2 - job_name: 'federate' 3 scrape_interval: 30s 4 honor_labels: true 5 metrics_path: '/federate' 6 params: 7 'match[]': 8 - '{job=~".+"}' 9 - '{__name__=~"job:.+"}' 10 static_configs: 11 - targets: 12 - 'prometheus-leaf-1:9090' 13 - 'prometheus-leaf-2:9090'Federation is best suited for hierarchical aggregation, not as a replacement for remote write .
Footnotes
-
Prometheus Federation Explained: Architecture & Pitfalls - Groundcover deep dive on federation architecture and challenges ↩
-
- 5Step 5
Configure retention based on environment needs. Production typically uses 30–60 days locally, with long-term storage handling anything beyond that.
- Development: 3–7 days (~10GB)
- Staging: ~14 days (~25GB)
- Production: 30–60 days (100GB+)
- Compliance: 1+ years (external storage required)
Use the
--storage.tsdb.retention.sizeand--storage.tsdb.retention.timeflags. Reducing retention requires a restart and cannot be undone — data older than the new period is immediately purged .Footnotes
-
How to Configure and Optimize Prometheus Data Retention - Retention settings per environment with storage guidance ↩
Cardinality: The Silent Killer
Cardinality is the single most impactful factor on Prometheus performance in production. Each unique combination of label values creates a separate time series, requiring its own chunk in memory and on disk.
Consider a metric http_requests_total with labels instance (3 values), method (5 values), and endpoint (1,000 values):
That's 15,000 separate chunks in the TSDB index, each needing memory for in-memory head chunks, index entries in postings and symbols, and disk I/O for compaction and querying. At a scrape interval of 15s, this generates approximately 4 million samples per hour across all series. A real-world optimization at one company reduced their active series from 10M to 877K — a 92% memory reduction (from ~60GB to <5GB) — simply by removing unnecessary labels and metrics 2.
Detecting High Cardinality
Use Prometheus's built-in introspection tools:
| Metric / Tool | Purpose |
|---|---|
prometheus_tsdb_head_series | Current number of active series in the head block |
prometheus_tsdb_head_chunks_created_total | Rate of new chunk creation |
top(10, count by (\_\_name\_\_)({...})) | Top 10 metric names by series count |
top(10, count by (job)({...})) | Top 10 jobs by series count |
label_join / label_replace | Identify which labels contribute most to cardinality |
Footnotes
-
Understanding and Optimizing Resource Consumption in Prometheus - Palark case study: 92% memory reduction through cardinality optimization ↩
-
How to Manage High Cardinality Metrics in Prometheus and Kubernetes - Grafana Labs guide on cardinality management strategies ↩
Prometheus Memory Usage by Cardinality Level
Approximate memory consumption at different active series counts
Cardinality Explosion Warning
Never add labels with unbounded values (user IDs, request IDs, IP addresses) to Prometheus metrics. A single metric with a user_id label and 100,000 users creates 100,000 time series. Instead, aggregate at the application level and expose pre-computed metrics, or use exemplars to attach trace IDs without increasing series count.
Long-Term Storage Solutions: Deep Comparison
Recording Rules: Pre-Computing for Performance
Recording rules are the most underutilized tool in the Prometheus toolbox at scale. They allow you to:
- Reduce query latency: Dashboards query a pre-computed recording rule instead of running expensive PromQL at render time
- Reduce cardinality: Aggregate away high-cardinality dimensions (e.g., drop
endpointwhile keepingjobandmethod) - Compose complex expressions: Break multi-stage calculations into named, debuggable intermediates
Naming convention is critical for maintainability. The Prometheus community recommends the pattern:
For example: job:http_requests:rate5m, service:availability:ratio_rate5m, cluster:node_cpu:avg_5m
This convention encodes the aggregation level, the source metric, and the operation applied — making it immediately clear what a recording rule produces and how it can be consumed by dashboards and alerts .
Rule Organization Best Practices
Structure your rule files logically to support team ownership and independent deployment:
/etc/prometheus/rules/ ├── recording_rules/ │ ├── infrastructure.yml │ ├── application.yml │ └── sli_slo.yml └── alerting_rules/ ├── critical_alerts.yml └── warning_alerts.yml
Always validate rules before deploying using promtool check rules <file>. In Kubernetes environments, use the PrometheusRule custom resource provided by the Prometheus Operator for declarative rule management .
Footnotes
-
Prometheus Recording Rules Documentation - Official Prometheus docs on recording and alerting rule syntax ↩ ↩2
Prometheus Production Maturity Lifecycle
Single Instance
Stage 1Deploy a single Prometheus server with default config. Focus on scrape coverage, basic alerting, and Grafana dashboards. Suitable for <100K active series."
Vertical Scaling & Optimization
Stage 2Increase resources, tune scrape intervals, add recording rules, manage cardinality. Introduce metric_relabel_configs to drop unneeded metrics. Suitable for 100K–1M active series."
Functional Sharding
Stage 3Split into multiple Prometheus instances by domain (infra, app, business). Add remote_write for long-term storage. Suitable for 1M–5M active series."
HA Replicas & Deduplication
Stage 4Deploy replica Prometheus instances for high availability. Configure external_labels for replica identification. Set up deduplication at the LTS layer (Thanos/Mimir)."
Distributed Observability Stack
Stage 5Full Thanos/Mimir/Cortex deployment with query frontends, caching, multi-tenancy. Federation for global views. 5M–50M+ active series. Dedicated SRE effort for observability infrastructure."
Remote Write Queue Tuning
Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds — if it falls behind time(), your queues are backing up. Increase max_shards or max_samples_per_send incrementally. Never set max_shards too high without monitoring network utilization, as this can overwhelm the receiver and cause cascading failures .
Footnotes
-
Optimizing Prometheus Remote Write Performance - Last9 guide on queue tuning, cardinality management, and relabeling ↩
High Availability: Making Prometheus Resilient
Prometheus has no built-in clustering — HA is achieved by running redundant, independent instances. The key design principle is: each replica scrapes the same targets independently and evaluates the same rules independently. This means:
- No shared state: Each replica maintains its own TSDB
- No split-brain risk: There's no consensus protocol to worry about
- Duplicate data is expected: Deduplication happens at the query layer (Thanos/Mimir/Cortex)
External labels are essential for HA deduplication. Configure unique replica labels:
1global: 2 external_labels: 3 cluster: 'us-east-1' 4 replica: 'prom-1'
The LTS backend uses the replica label to identify and deduplicate data from both instances .
Footnotes
-
Prometheus High Availability (HA) - New Relic documentation on HA configuration with external labels ↩
Monitoring Prometheus Itself
A production Prometheus that isn't monitored is a ticking time bomb. Key self-monitoring metrics:
| Metric | What It Tells You | Action Threshold |
|---|---|---|
prometheus_tsdb_head_series | Active series count | > 80% of historical peak |
process_resident_memory_bytes | Memory consumption | > 80% of available RAM |
prometheus_remote_storage_queue_highest_sent_timestamp_seconds | Remote write lag | > 60s behind time() |
prometheus_target_sync_length_seconds_sum | SD processing overhead | Increasing trend |
prometheus_tsdb_compactions_total | Compaction workload | Sustained high rate |
prometheus_rule_evaluation_duration_seconds | Rule evaluation time | p99 > 1s |
prometheus_sd_refresh_duration_seconds | Service discovery latency | Increasing trend |
Set alerts on:
- Memory pressure:
process_resident_memory_bytes > 0.8 * available_memory - Remote write backlog:
time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300 - Scrape failures:
up == 0for critical targets for > 5 minutes - Rule evaluation delays:
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 10
Prometheus in Production — Key Concepts
Long-Term Storage Solutions Comparison
Evaluated across 6 dimensions for production deployment
Knowledge Check
You notice that a single Prometheus instance is consuming 50GB of memory with 10 million active series. What is the most effective first step to reduce resource consumption?
Explore Related Topics
Master Class: Kubernetes Fundamentals
Kubernetes is the industry‑standard platform for orchestrating containerized microservices, separating cluster management (Control Plane) from workload execution (Worker Nodes) and emphasizing declarative, version‑controlled deployments.
- The Control Plane (kube‑apiserver, etcd, scheduler, controller‑manager) stores the cluster’s desired state and makes global scheduling decisions.
- Worker nodes run kubelet, kube‑proxy, and a container runtime to host Pods and enforce networking rules.
- Core Kubernetes objects—Pods, Services, and Deployments—enable self‑healing, stable networking, and scalable rollouts.
- Declarative YAML manifests (
kubectl apply) support IaC and GitOps, while imperative commands are discouraged. - Production workloads should use higher‑level abstractions (Deployments/StatefulSets) instead of bare Pods to ensure resilience.
Java Roadmap 2026: From Core Language to Production-Ready Professional
2026 Java roadmap outlines language, frameworks, concurrency, AI, and AOT skills for production‑ready developers.
- Java 25 LTS is the current baseline; Oracle now follows a 2‑year LTS cycle (next LTS Java 29 in 2027).
- Virtual threads and Structured Concurrency (Project Loom) simplify high‑scale I/O, reducing the need for reactive libraries.
- Spring Boot 4/Spring 7 with Spring AI and LangChain4j make LLM integration essential.
- Choose GraalVM Native Image for native binaries or Project Leyden AOT caching for 40‑60 % faster JVM startup, based on compatibility vs. startup speed.