Skip to main content

Command Palette

Search for a command to run...

Production-Ready Is Not a Feeling

The Baseline Every Service Needs

Updated
9 min read
Production-Ready Is Not a Feeling

"It works in staging" is not a deployment strategy.

Almost every engineering team has shipped something that technically passed CI, survived unit tests with over 80% coverage, and promptly became a 3am incident within its first week of real traffic. Not because anyone was careless, but because "working" and "production-ready" are genuinely different bars, and most teams never write down what the second one actually means.

Let's be honest, AI can help you implement most of what's in this post. What it can't do is tell you what order it matters in, or which gaps will hurt you first at 3am. That's the judgment this post is for.

This post is a concrete baseline of what every service needs before it earns the right to take production traffic — regardless of language, framework, or what it does. Sections are ordered by priority — the first four are non-negotiable before go-live, the last two are critical but can follow shortly after.


1. Observability: You Can't Fix What You Can't See

Observability is generally built on three pillars: logs, metrics, and traces (Charity Majors (CTO of Honeycomb) considers this traditional "three pillars" model to be outdated and destructive, but let's save that for another post).

Structured Logging

Logs tell you what happened and when, giving you the full event detail you need to understand a specific failure. But to make it context rich, every log line should be machine-readable JSON with a consistent schema. At minimum:

{
  "timestamp": "2025-01-15T09:23:11Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "event": "charge_failed",
  "reason": "upstream_timeout",
  "duration_ms": 3041
}

Log at meaningful boundaries: request ingress, egress, external call completions, and every handled error or fallback. The rule of thumb: if an on-call engineer would want to know it happened, log it.

Avoid freeform strings. "something went wrong with the payment" is useless under pressure. "charge_failed" with structured context is searchable, alertable, and aggregatable.

Metrics: The Four Golden Signals

Metrics tell you the shape of your system over time: trends, thresholds, and anomalies at a glance. A metric spike tells you something is wrong.

Track these four for every service, per endpoint:

Signal What it tells you
Latency How long requests take (p50, p95, p99 — not just average)
Error Rate % of requests returning errors
Traffic Request volume (requests/sec)
Saturation How close you are to capacity (CPU, memory, connection pools)

Average latency lies. A p99 of 4000ms hidden inside a p50 of 80ms means 1 in 100 users is having a terrible experience. Always track percentiles.

Distributed Tracing

Traces tell you where time was spent across a request's journey through your system.

If your service calls anything - a database, a cache, another service - instrument distributed tracing with W3C traceparent header (widely used example) propagation. Each external call becomes a child span. When latency spikes, you open one trace and immediately see which call is the culprit instead of correlating logs across three systems.

This is the single observability investment with the highest on-call ROI.


2. Alerting: Pages Should Mean Something

The most common alerting failure is alerting on the wrong thing, e.g. raw error counts, CPU thresholds, or anything that fires so frequently it gets tuned out.

Alert on SLO burn rates, not raw metrics.

Define a Service Level Objective (e.g., 99.5% of requests succeed within 500ms over a 30-day window) that reflects what actually matters to your customers — not an arbitrary 99.99% because it sounds right. It should not alert you even if your service is completely down while no real customers are hitting it.

Then alert on how fast you're consuming that error budget:

  • Fast burn (e.g., burning 5% of your monthly budget in an hour) → page immediately

  • Slow burn (e.g., elevated error rate that will breach SLO in 3 days) → create a ticket

This approach means every page represents a genuine threat to your SLO. Engineers stop ignoring alerts. On-call becomes survivable.

Every alert must link to a runbook. The runbook answers exactly:

  1. How do I confirm what's actually broken?

  2. How do I mitigate it right now?

  3. Who do I escalate to if I can't?

If your runbook doesn't answer those three, it's not a runbook — it's a note to yourself.


3. Security: Least Privilege by Default

Scoped credentials: each service authenticates to each dependency with dedicated credentials, not a shared master key.

Input validation: validate and sanitize all inputs at your service boundary before they touch any downstream system.

PII in logs: if your service handles personal data, ensure it's redacted or excluded before reaching your log pipeline.


4. Deployment: Ship Safely, Roll Back Fast

Health Checks

Expose at least two health check endpoints:

  • Process health → is the service running? (a /health returning 200 is enough)

  • Dependency health → can the service actually reach what it needs? (checks DB connectivity, cache reachability, critical upstream availability)

The distinction matters. A service that's running but can't reach its database should not receive traffic. Most health check implementations only cover the first case, which means your load balancer happily routes traffic to a service that will fail every request.

Whether you're behind a load balancer, a service mesh, or a reverse proxy, wire the dependency health check into whatever controls traffic routing.

Progressive Rollouts

Deploy via canary or blue/green. Tie automated rollback to your service's error rate, not just infrastructure health. A deployment that introduces a logic bug will look perfectly healthy to CPU and memory monitors while users are getting errors.

Disaster Recovery

Incidents will happen. The question is whether your team has a documented, tested recovery plan before they do. Know how to redeploy your service from scratch, where your backups are, and what the recovery sequence looks like. If your DR process has never been rehearsed, it doesn't exist.

5. Gatekeeper Testing

Without automated baseline validation on Day 1, the deployment strategy is just guessing. Before a service takes production traffic, it must pass two gates:

  • Core Unit Tests: You don't need a vanity metric like 100% total coverage, but you must have total coverage over core business logic, state transitions, and critical risk areas (like financial mutations or data transformations).

  • Smoke Tests: A lightweight post-deployment suite that executes in the environment immediately after a build drops, confirming network line-of-sight and basic handshakes with every integrated dependency.

  • Integration Tests: Test the full flow end-to-end in a realistic environment.


The above five are your go-live gates: if any are missing, you're not ready for production traffic. The following two are equally critical to long-term stability, but can realistically follow in the sprint after launch.


6. Resilience: Assume the Network Will Lie to You

Timeouts

Every external call needs an explicit timeout. Never rely on library or upstream defaults — they are almost always too long. Set timeouts based on your p99 baseline for that call, not on optimism.

A call that hangs indefinitely doesn't just affect that request. It ties up a thread or connection, which accumulates, which eventually exhausts your pool and takes down requests that have nothing to do with the slow dependency.

Retries with Jitter

Retries are only safe on idempotent operations. When you do retry, use exponential backoff with randomized jitter:

wait = base_delay * 2^attempt + random(0, max_jitter)

Without jitter, every instance of your service retries in lockstep — creating a synchronized spike that hits the recovering upstream all at once. Imagine your upstream just recovered from a crash and 50 instances of your service all retry at exactly the same moment. That wall of traffic can knock it back down before it stabilises. Jitter staggers the retries across time so the upstream sees a trickle instead.

Fallback Strategies

For every external dependency, decide explicitly: what happens when it's unavailable? Options include serving stale cached data, returning a degraded response with a flag, or failing the request entirely.

The worst answer is implicit behavior. Undefined fallbacks mean your service does something unpredictable under failure — which is the hardest kind of incident to debug at 3am.


7. Advanced Validation: Breaking the System to Prove It Scale

Load Tests & Resource Sizing

These only tell you something useful once you have a baseline: known peak traffic, a throughput SLA, or prior benchmark data. Without that, you're running numbers against a vacuum.

Once you have a baseline, load test against it and size your resources accordingly: connection pool limits, memory headroom for request buffering, thread counts, and horizontal scaling triggers. The common mistake is sizing for average traffic. Size for your realistic peak with headroom, and define your scaling trigger before you need it, not during an incident.


Self-Assessment

Go-live gates — all must be yes before shipping:

  • [ ] Are logs structured JSON with consistent trace_id and span_id?

  • [ ] Are you tracking p95 and p99 latency, not just averages?

  • [ ] Are you alerting on SLO burn rates, not raw error counts?

  • [ ] Does every alert link to a runbook with a mitigation step?

  • [ ] Is PII excluded from logs by design?

  • [ ] Are service credentials scoped per dependency?

  • [ ] Does your dependency health check gate traffic routing?

  • [ ] Can you roll back within 5 minutes if error rates spike post-deploy?

  • [ ] Has your disaster recovery plan been rehearsed, not just documented?

  • [ ] Do smoke tests run automatically after every deployment?

Ship these sprint two:

  • [ ] Does every external call have an explicit timeout?

  • [ ] Do retries use exponential backoff with jitter?

  • [ ] Is every fallback behavior explicitly defined?

  • [ ] Have you load tested against a realistic traffic baseline?

  • [ ] Have you verified resilience patterns trigger under chaos conditions?

Hopefully, this baseline helps shape a clear roadmap for your next service launch and saves your team from that next 3am page.