unsubbed.co

Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

Open-source metrics and alerting, honestly reviewed. Written for founders who’ve just discovered their cloud monitoring bill.

TL;DR

  • What it is: Open-source (Apache-2.0) monitoring system and time-series database — the de facto standard for metrics collection in cloud-native infrastructure [README].
  • Who it’s for: Engineering teams running Kubernetes, containers, or distributed services who need deep visibility into system health. Less suited to non-technical founders who want a dashboard without hiring a DevOps person [2][4].
  • Cost savings: AWS Managed Service for Prometheus costs ~$0.03 per million samples — at 50,000 samples/sec that’s roughly $3,900/month. Self-managed on your own servers runs ~$2,500/month including labor at the same scale [3]. For small workloads, self-hosted Prometheus on a $20/mo VPS is effectively free.
  • Key strength: The undisputed standard for cloud-native metrics. 63,198 GitHub stars. Second project to graduate from CNCF after Kubernetes. Grafana, Kubernetes, and every major cloud provider treat it as the default [README][1].
  • Key weakness: Long-term storage is not built in — default retention is 15 days [4]. Horizontal scaling is genuinely painful without Thanos, Cortex, or Mimir bolted on. PromQL has a real learning curve [2][4].

What is Prometheus

Prometheus is a monitoring system and time-series database. You point it at your applications and infrastructure, it scrapes numeric metrics on a schedule, stores them with timestamps and labels, and lets you query, alert, and dashboard against that data.

It was built at SoundCloud starting in 2012, open-sourced in 2015, and handed off to the Cloud Native Computing Foundation in 2016 as its second graduated project — meaning it now has vendor-neutral governance, not a single company’s interests driving the roadmap [docs].

The technical model is distinctive. Rather than requiring your applications to push data to a central collector, Prometheus pulls — it reaches out to your services at configured intervals and scrapes a /metrics endpoint. This pull model means Prometheus controls the collection cadence, can detect when a target goes down, and doesn’t require you to open inbound firewall ports on your monitoring server [README][1].

Metrics are stored as time series: a metric name (like http_requests_total) plus a set of key-value labels (like {method="POST", status="200"}) plus a sequence of timestamped float64 values. The label model is what makes it “dimensional” — you can slice and aggregate across any combination of labels at query time using PromQL [docs][README].

The project has 63,198 GitHub stars. Every major observability platform — Grafana, Datadog, New Relic, Dynatrace — supports Prometheus metrics format as an import or export target. It’s the lingua franca of cloud-native metrics.


Why people choose it

The reviews and community discussion converge on the same story: Prometheus wins because it became the standard before most alternatives existed, and the standard is hard to dislodge.

It’s what Kubernetes expects. Kubernetes exposes metrics in Prometheus format by default. The entire cloud-native tooling ecosystem — service meshes, API gateways, ingress controllers, databases — ships with /metrics endpoints that Prometheus can scrape with zero configuration [1]. One TrustRadius reviewer summarizes it flatly: “it helps cover all the useful metrics which have to be monitored and enables complex alerting rules with customized notification channels” [2].

The pull model solves real problems. In a dynamic environment where containers start and stop, requiring every service to know where to push metrics creates configuration drift. Prometheus flips it: services just expose a /metrics endpoint, and Prometheus discovers them via Kubernetes API, Consul, or static config [README][1]. This also means if a target disappears, Prometheus knows — it’s not waiting for a push that never arrives.

The cost case is clear for engineers who can run it. A TrustRadius reviewer notes directly: “being open-source eliminates licensing fees, allowing organizations to solve monitoring problems without cost” [2]. For teams with DevOps capacity, self-hosted Prometheus at small to medium scale is free infrastructure versus four-figure monthly bills for managed alternatives [3].

The ecosystem is genuinely vast. Hundreds of exporters exist for hardware, databases, message queues, HTTP endpoints, and cloud services. The community has built exporters for almost every system you’d want to monitor, and Grafana has become the standard visualization layer on top [1][2].

Where Prometheus loses the argument: PromQL. TrustRadius reviewers specifically call it out — “somewhere complex while building advanced dashboards” [2]. It’s expressive, but it’s not intuitive if you’re not already thinking in terms of rate functions and label selectors. Non-engineers don’t use Prometheus directly; they use Grafana dashboards built by engineers.


Features

Core metrics engine:

  • Multi-dimensional time series model with metric names and key/value label sets [README]
  • HTTP pull model — Prometheus scrapes /metrics endpoints at configured intervals [README][1]
  • Pushgateway for batch jobs and ephemeral targets that can’t be scraped [README]
  • Service discovery: Kubernetes, Consul, EC2, DNS, file-based, and more [README][1]
  • PromQL: flexible query language for filtering, aggregation, rate calculations, and joins across label dimensions [docs]
  • Built-in TSDB with configurable retention (default 15 days — see Deployment section for why this matters) [4]
  • Alerting rules written in PromQL, firing to Alertmanager which handles deduplication, silencing, and routing to PagerDuty, Slack, email [README]

Alertmanager (separate component):

  • Groups related alerts to reduce notification noise
  • Routes to different receivers based on labels (e.g., database alerts → DBA on-call, frontend alerts → frontend team)
  • Silencing and inhibition rules
  • High-availability clustering for alert delivery [README]

Visualization:

  • Basic expression browser and console templates built in — not production-grade dashboards [README]
  • Grafana is the standard pairing — essentially every Prometheus deployment ends up with Grafana in front of it [1][2]

Integrations:

  • Client libraries for Go, Java, Python, Ruby, Rust, and others for instrumenting your own applications [README][website]
  • Hundreds of exporters for databases (PostgreSQL, MySQL, Redis), infrastructure (node_exporter for Linux host metrics, blackbox_exporter for endpoint probing), Kubernetes components, and more [1][website]
  • Native support in Kubernetes, Istio, Envoy, and most CNCF projects

What’s NOT included:

  • Long-term storage beyond the local TSDB retention window — you need Thanos, Cortex, or Mimir for that [4]
  • A polished UI — the built-in UI is for debugging queries, not for showing stakeholders [README]
  • Multi-tenancy — a Prometheus server is single-tenant by default [4]
  • Traces or logs — Prometheus is metrics-only. Combine with Jaeger/Tempo (traces) and Loki/Elasticsearch (logs) for full observability [1]

Pricing: SaaS vs self-hosted math

Self-hosted Prometheus:

  • Software: $0, Apache-2.0 [README]
  • Infrastructure: depends entirely on your metrics volume and retention window

Managed Prometheus services:

  • AWS Managed Service for Prometheus (AMP): ~$0.03 per million samples ingested, plus $0.03/GB/month for storage beyond default [3]
  • Google Cloud Managed Service for Prometheus: ~$0.03–$0.06 per million samples by region [3]
  • Grafana Cloud: starts at $29/month for 15,000 samples/sec, includes Grafana dashboards [3]

Concrete cost math from the DEV Community analysis [3]:

Scenario: 50,000 samples/sec (moderate production workload)

OptionMonthly cost
AWS AMP (managed)~$3,918/month
Self-managed (servers + storage + labor)~$2,480/month

The self-managed number breaks down as: 3 × r5.large EC2 instances ($750), 10TB storage ($230), and 15 hours/month of DevOps time at $100/hour (~$1,500) [3].

The gap inverts at smaller scales. For a startup scraping a handful of services at maybe 500 samples/sec, managed Prometheus (or Grafana Cloud’s free tier) costs nothing, while self-managed still demands the same minimum labor overhead. The crossover point where self-managed wins is roughly when your DevOps team is already there and the managed bill starts climbing past $500–1,000/month [3][4].

The honest math for a small engineering team:

  • 0–5,000 samples/sec: Grafana Cloud free tier or AWS AMP at negligible cost. Don’t self-manage yet — the operational overhead isn’t worth it.
  • 5,000–50,000 samples/sec: evaluate managed vs. self-managed based on whether you have a dedicated DevOps/SRE hire.
  • 50,000+ samples/sec with steady traffic: self-managed with Thanos or Mimir is almost certainly cheaper [3].

Deployment reality check

Getting a basic Prometheus instance running is genuinely simple. The Docker one-liner from the README works:

docker run --name prometheus -d -p 127.0.0.1:9090:9090 prom/prometheus

Precompiled binaries, Docker images on Quay.io and Docker Hub, and Helm charts for Kubernetes are all available [README]. For a single-server setup monitoring a handful of services, a technical founder can be running in under an hour.

Where it gets complicated:

Long-term storage. The default local TSDB only retains 15 days of data [4]. If you want historical metrics for capacity planning, compliance, or incident post-mortems, you need to add Thanos, Cortex, or Mimir — each of which is its own distributed system with its own operational complexity. Last9’s analysis is direct: “these solutions introduce additional complexity” and require teams to deploy and manage additional components [4].

Horizontal scaling. A single Prometheus server has a memory and storage ceiling. When you exceed it — more targets, higher scrape frequency, longer retention — you need to shard. Sharding means running multiple Prometheus servers and either federating them or fronting them with a query layer. The Last9 analysis identifies this as the primary pain point at scale: “sharding makes the infrastructure more complex, introducing much of that management overhead” [4].

Management overhead. A production Prometheus stack isn’t just Prometheus. It’s Prometheus + Alertmanager + Pushgateway (for batch jobs) + exporters per-service + Grafana + dashboards + alert rules. In a multi-environment setup (dev, staging, prod), you multiply that by three. Last9 estimates DevOps/SRE teams can spend “several hours every day managing and maintaining multiple servers and instances” instead of shipping features [4].

PromQL. TrustRadius reviewers consistently flag the learning curve [2]. Writing a correct rate alert that accounts for counter resets, histogram buckets, and label cardinality is a skill. Budget time for this.

Realistic time estimates:

  • Basic instance monitoring a single server: 30 minutes for a technical person
  • Production setup with alerting, Grafana dashboards, and Kubernetes integration: 1–3 days
  • HA setup with long-term storage (Thanos/Mimir): 1–2 weeks of engineering time, ongoing maintenance

Pros and cons

Pros

  • The industry standard. Every tool in cloud-native infrastructure supports Prometheus natively. Switching costs for alternatives are high; switching to Prometheus from anything else is usually straightforward [1][2].
  • Apache-2.0 license. Genuinely open source — no “fair-code” restrictions, no commercial use limits [README]. You can embed it, redistribute it, build products on top of it.
  • Pull model is operationally cleaner. Prometheus knows when targets are down. You don’t need to configure every service with a push destination. Service discovery handles dynamic environments automatically [1][README].
  • Cost-effective at medium scale. Self-hosted on existing infrastructure is effectively free software. At 50,000 samples/sec, self-managed saves roughly $1,400/month vs. managed services — and that gap widens at higher volumes [3].
  • Grafana integration. The Prometheus + Grafana pairing is the most widely documented, most template-rich monitoring stack in existence. Whatever you want to dashboard, someone has already built the Grafana template [1][2].
  • 63,198 GitHub stars. Community this large means exporters for everything, tutorials for every problem, and enough StackOverflow answers that you’re rarely debugging blindly.
  • CNCF governance. Vendor-neutral. No single company controls the roadmap or can change the licensing [docs][README].

Cons

  • 15-day default retention. Not a footnote — this bites teams who don’t plan for it. If you want six months of metrics, you’re adding Thanos or Mimir from day one [4].
  • Horizontal scaling is painful. A single Prometheus server has a ceiling. Sharding introduces complexity that managed services abstract away. The Last9 piece describes this as a fundamental challenge: “you’ll need to scale out to balance the workload” with sharding that “makes querying and troubleshooting more difficult” [4].
  • PromQL learning curve. It’s powerful. It’s also not SQL. TrustRadius reviewers specifically flag it as a barrier [2]. Non-engineers don’t interact with it — they interact with Grafana dashboards that an engineer built.
  • Metrics only. Prometheus solves one problem. Traces require Jaeger or Tempo. Logs require Loki or Elasticsearch. Full observability is a multi-tool stack [1].
  • No built-in multi-tenancy. One Prometheus installation is one tenant. If you’re running a platform with multiple customers, you need a more complex architecture [4].
  • Community support only. Open-source means no vendor support SLA. You’re relying on documentation, GitHub issues, and community forums [2]. For some organizations, this is a blocker.
  • High-volume managed costs scale sharply. At 50,000 samples/sec, AWS AMP runs nearly $4,000/month [3]. If you’re not ready to self-manage, monitoring costs can grow faster than application costs.

Who should use this / who shouldn’t

Use Prometheus if:

  • You’re running Kubernetes or container workloads and want the tool the entire ecosystem was built around.
  • Your team has at least one engineer comfortable with Linux, Docker, and yaml — someone who won’t be stopped by configuring a scrape job.
  • You’re at medium-to-large scale (50,000+ samples/sec) and the managed service math looks painful.
  • You need Apache-2.0 licensing for compliance or redistribution reasons.
  • You want the widest possible exporter coverage — if there’s a system you need to monitor, there’s almost certainly a Prometheus exporter for it [1].

Skip self-hosted, use managed Prometheus (AWS AMP or Grafana Cloud) if:

  • You’re a small team at early stage — the operational overhead of self-managing Prometheus isn’t worth it until you have a dedicated DevOps hire.
  • Your workload is unpredictable (viral traffic spikes, event-driven systems) — managed services scale automatically [3].
  • You need guaranteed uptime for your monitoring without building HA infrastructure yourself.

Skip Prometheus entirely if:

  • You’re a non-technical founder who needs a dashboard without writing PromQL or configuring scrape jobs. Look at Datadog, Better Uptime, or a managed APM tool instead.
  • You need unified metrics + traces + logs in one query interface without stitching tools together. SigNoz (open-source) or Datadog cover this [1].
  • Your compliance requirements restrict data residency in ways that self-hosted Prometheus complicates and managed Prometheus on your chosen cloud doesn’t solve.

Alternatives worth considering

  • Grafana Mimir — horizontally scalable, multi-tenant Prometheus-compatible backend. If you’re hitting Prometheus scaling limits, Mimir is the self-hosted answer. It’s what Grafana Cloud runs under the hood.
  • Thanos — adds long-term storage, global query view, and HA to existing Prometheus deployments. Lower overhead than Mimir for smaller teams. More popular for “we need retention beyond 15 days” use cases [4].
  • VictoriaMetrics — drop-in Prometheus-compatible storage that’s more resource-efficient than Prometheus’s own TSDB. Worth evaluating if you’re storing high cardinality metrics and hitting memory pressure.
  • SigNoz — open-source alternative that combines metrics (Prometheus-compatible), traces, and logs in one UI [1]. Relevant if you want unified observability without running separate stacks.
  • Datadog — full-stack managed observability. Expensive at scale (pricing based on hosts + metrics + logs), but genuinely unified and non-technical-user-friendly. The benchmark for what managed observability looks like when money isn’t the constraint.
  • Grafana Cloud — managed Prometheus + Loki + Tempo + Grafana in one product. Free tier is generous. Worth considering before self-hosting for teams under 15,000 samples/sec [3].
  • AWS CloudWatch / Google Cloud Monitoring — if your infrastructure is entirely on one cloud, the native monitoring may be good enough and operationally simpler. Becomes expensive and limiting fast once you cross cloud boundaries.

Bottom line

Prometheus is not a startup gamble — it’s the monitoring standard that half the cloud-native world runs on. The Apache-2.0 license, the pull architecture, and the CNCF governance are real advantages. For teams already running Kubernetes, adopting Prometheus is less a choice than an inevitability; the ecosystem assumes it. The trade-offs are also real: 15-day default retention, PromQL’s learning curve, and horizontal scaling complexity that sends teams to Thanos or Mimir before they’re ready. Self-hosting saves money at scale — meaningfully so past 50,000 samples/sec — but it costs DevOps time that not every team has. For small teams, start with Grafana Cloud’s free tier before committing to the operational weight of self-managed infrastructure. For teams with engineering capacity and growing monitoring bills, self-hosted Prometheus with Thanos for retention is among the most defensible infrastructure bets you can make.

If configuring scrape jobs and retention policies is the blocker, that’s exactly what upready.dev sets up for clients. One engagement, done, you own the stack.


Sources

  1. SigNoz“What is the Advantage of Prometheus - Top 5 Advantages in 2026”. https://signoz.io/guides/what-is-the-advantage-of-prometheus/
  2. TrustRadius“Prometheus Reviews & Ratings 2026” (112 reviews, 8.0/10). https://www.trustradius.com/products/prometheus/reviews
  3. DEV Community (Binyam)“Hosted Prometheus vs. Self-Managed: A Neutral Guide to Costs, Control, and Trade-offs”. https://dev.to/binyam/hosted-prometheus-vs-self-managed-a-neutral-guide-to-costs-control-and-trade-offs-3j6n
  4. Last9“Self-managed Prometheus vs Managed Prometheus” (Jan 4, 2023). https://last9.io/blog/self-managed-prometheus-vs-managed-prometheus/

Primary sources:

Features

Integrations & APIs

  • Plugin / Extension System