← Back to all articles
Challenges

Observability for Startups: Logs, Metrics, and Traces Without Breaking the Bank

By Marc Molas·June 4, 2024·10 min read

If you're finding out about production issues from your users, you don't have observability — you have a support channel.

This is more common than you'd think. The app goes down, a customer fires off an angry email, the team starts digging through logs in the AWS console, someone says "works on my machine," and two hours later somebody finally finds the problem. Sound familiar?

Observability isn't a luxury reserved for large companies with dedicated SRE teams. It's the ability to understand what's happening in your system without guessing. And at a startup, where every incident can cost you users (and your next funding round), it's an investment that pays for itself from day one.

The three pillars of observability

There's a classic framework that breaks observability into three pillars: logs, metrics, and traces. They're not interchangeable — each one answers a different question.

  • Logs: What happened? They're records of discrete events. A user tried to log in, a transaction failed, a service restarted.
  • Metrics: How much and how fast? They're numerical data aggregated over time. Average latency, error rate, CPU usage.
  • Traces: Where in the system? They show the path a request takes through multiple services. The request entered through the API gateway, hit the auth service, queried the database, called the payment service.

You don't need all three from day one. But you do need to understand when to introduce each one.

Logs: the foundation of everything

Every system generates logs. The difference is whether you generate them in a useful way or just console.log("error here") and hope for the best.

Structured logging is the first thing you should implement. Instead of loose strings, generate logs in JSON format with consistent fields: timestamp, level, service, message, request ID, user. This lets you search, filter, and aggregate automatically.

{
  "timestamp": "2024-06-04T10:23:45Z",
  "level": "error",
  "service": "payment-api",
  "message": "Payment processing failed",
  "requestId": "abc-123",
  "userId": "user-456",
  "errorCode": "GATEWAY_TIMEOUT"
}

Log levels should mean something. If everything is INFO or everything is ERROR, you don't have levels — you have noise. Define a clear convention: DEBUG for development, INFO for normal flows, WARN for recoverable situations, ERROR for failures that need attention.

Centralize your logs. If you have to SSH into three different servers to investigate a problem, you've already lost. Options by budget:

  • Minimal budget: Loki + Grafana (open source, lightweight, perfect to start with)
  • Cloud native: CloudWatch Logs (AWS), Cloud Logging (GCP)
  • All-in-one: Datadog, New Relic (pricier, but more complete)
  • Classic: ELK Stack (Elasticsearch, Logstash, Kibana) — powerful but requires maintenance

Metrics: the 4 golden signals

Google defined the 4 golden signals of monitoring, and they're still the best starting point:

  1. Latency: How long does your service take to respond? Not just the average — the p95 and p99 percentiles are what matter. If your average latency is 200ms but the p99 is 5 seconds, you have a problem the average is hiding.

  2. Traffic: How many requests are you receiving? This helps you understand usage patterns and detect anomalies (unexpected spikes or sudden drops).

  3. Errors: What percentage of requests are failing? Both explicit errors (HTTP 5xx) and implicit ones (correct responses but with wrong data).

  4. Saturation: How full is your system? CPU, memory, disk, database connections. When something saturates, everything else starts degrading.

There's an important distinction between application metrics and infrastructure metrics. Infrastructure metrics (CPU, memory, disk) tell you that something is wrong. Application metrics (requests per second, error rate, latency per endpoint) tell you what is wrong. You need both.

Tools:

  • Prometheus + Grafana: The open source standard. Prometheus collects and stores metrics, Grafana visualizes them. Free, powerful, and backed by a massive ecosystem.
  • Datadog: Metrics, logs, and traces in a single platform. Excellent but expensive — watch the bill.
  • New Relic: Similar to Datadog. Has a decent free tier to get started.

Traces: when one service isn't enough

If you have a simple monolith, you probably don't need distributed tracing yet. But as soon as you have two or more services communicating with each other, traces become essential.

A distributed trace shows you the complete path of a request through your system. You see exactly where time is being spent, where it fails, and which service is causing the bottleneck.

OpenTelemetry has become the standard for instrumentation. It's open source, vendor-neutral, and supports logs, metrics, and traces. If you're going to invest time instrumenting your code, do it with OpenTelemetry — that way you're not locked into any specific vendor.

Tools:

  • Jaeger: Open source, created by Uber. Perfect for getting started with distributed tracing.
  • Zipkin: Another open source option, simpler than Jaeger.
  • Datadog APM / New Relic: If you're already using their platform for metrics, adding traces is a natural next step.

The startup-friendly stack

Don't try to implement everything at once. Here's a sensible progression:

Phase 1 — The minimum viable setup (from day 1):

  • Structured logging in JSON
  • Centralized logs (Loki + Grafana or CloudWatch)
  • Basic infrastructure metrics (whatever your cloud provider gives you for free)
  • Alerts on HTTP 5xx errors and latency

Phase 2 — Maturation (once you have real users):

  • Application metrics with Prometheus + Grafana
  • Per-service dashboards with the 4 golden signals
  • More sophisticated alerts with severity levels

Phase 3 — Distribution (once you have multiple services):

  • OpenTelemetry for instrumentation
  • Distributed traces with Jaeger or your preferred APM
  • Correlation between logs, metrics, and traces (the request ID is the key)

Cost optimization: don't go broke monitoring

Observability can get expensive fast. Datadog is excellent until you see the invoice. Here are some strategies:

  • Trace sampling: You don't need to trace 100% of requests. 10-20% is usually enough to detect problems. Always trace requests that result in errors.
  • Retention policies: Do you need logs from 6 months ago? Probably not for day-to-day debugging. Set short retention for detailed data and long retention for aggregated metrics.
  • Log levels in production: Don't send DEBUG-level logs to production. That's expensive noise. INFO and above is usually sufficient.
  • Smart alerting: Every alert that isn't actionable is a cost — not just financially, but in team attention.

Alerting: the art of avoiding fatigue

A bad alerting strategy is worse than having no alerts at all. If your team gets 50 notifications a day, they'll learn to ignore all of them — including the one that matters.

Alert on symptoms, not causes. Alert when the error rate climbs above 1%, not when CPU hits 80%. CPU at 80% might be normal under load; a 5% error rate never is.

Every alert needs a runbook. If someone gets an alert at 3 AM, they should know exactly what to look at first. Without a runbook, it's panic + SSH + "what is this."

Classify by severity. Not everything is urgent. An intermittent error on a rarely used endpoint can wait until morning. A payment service outage cannot.

Observability as a culture

Observability isn't just tools — it's an engineering mindset. Teams that adopt it from the start debug faster, deploy with more confidence, and sleep better.

At Conectia, the senior engineers we embed in your team build observable systems by default. Not as an add-on after the third production outage, but as a fundamental part of the architecture. Because when your system grows and complexity increases, the difference between a team that knows what's happening in production and one that's guessing is the difference between scaling and surviving.


Is your team finding out about production problems from users? Talk to a CTO — we embed senior engineers who build observability in from the very first deploy.

Ready to build your engineering team?

Talk to a technical partner and get CTO-vetted developers deployed in 72 hours.