What's the difference between a 500 and 503 status code in microservices?

A 500 indicates an unexpected internal error that may or may not resolve on retry, while a 503 signals the service is temporarily unavailable—often due to overload—and the request should be retried after a delay.

How long should timeouts be between microservices?

Set timeouts based on your service's 99th percentile response time, not worst-case scenarios—typically 200-500ms for fast operations—to fail fast and prevent resource exhaustion during degraded conditions.

What is a correlation ID and why does it matter?

A correlation ID is a unique identifier passed through every service in a request chain, allowing you to trace errors and logs across distributed systems as if they were a single transaction.

How Do You Structure Error Handling Across Service Boundaries?

Why Does Error Handling Fall Apart When Services Talk to Each Other?

You have try-catch blocks in place. Your unit tests pass. Everything looks solid until Service A calls Service B—and suddenly you're staring at a 500 error with no context, a swallowed exception that never reached your logs, or worse, a cascade failure that took down three microservices because one database connection timed out. Distributed error handling isn't just about catching exceptions; it's about preserving context, communicating intent, and preventing failures from spreading like a virus through your architecture. When services don't share a process—or even a programming language—your standard stack traces become useless and your usual assumptions about control flow break down completely.

The real problem isn't that developers don't know how to write try-catch blocks. It's that error handling strategies designed for monolithic applications don't translate to distributed systems. In a single codebase, you can throw an exception and trust it'll bubble up to a global handler. Across HTTP boundaries, that exception becomes a JSON payload with a status code—and what you include (or exclude) in that payload determines whether the calling service can recover gracefully or flails blindly into retry loops that make everything worse.

What Information Should Travel with an Error Across Service Boundaries?

HTTP status codes are a start, but they're woefully inadequate for operational debugging. A 500 tells you something broke; it doesn't tell you what broke or whether retrying makes sense. Your error responses need to carry structured data that downstream services can act on programmatically while still providing human-readable context for your on-call engineer.

Here's what belongs in every cross-service error response: a machine-readable error code (something like PAYMENT_GATEWAY_TIMEOUT rather than just 500), a correlation ID that ties this error back to the original request across your entire call chain, and a severity indicator that tells the caller whether this is a transient glitch worth retrying or a permanent failure that needs manual intervention. Services like Stripe's API demonstrate this pattern well—their error objects include a type, a specific code, a human message, and often a parameter path when validation fails. This structure lets calling code make intelligent decisions without parsing strings.

Your error payloads should also distinguish between client errors (4xx) where the caller needs to fix something and infrastructure errors (5xx) where the service itself is struggling. This distinction matters because it determines your retry strategy. A 400 Bad Request won't magically fix itself on the third attempt, but a 503 Service Unavailable might resolve once your load balancer shifts traffic. Consider implementing the RFC 7807 Problem Details standard for HTTP APIs—it provides a consistent, extensible format for error responses that plays nicely with existing HTTP infrastructure.

How Do You Prevent Error Cascades from Taking Everything Down?

A single slow database query shouldn't turn into a distributed denial of service against your own services. Yet that's exactly what happens when every service in a call chain waits synchronously for timeouts to expire, each layer adding its own delay while holding connections open. The result is a traffic jam that starts at one bottleneck and ripples outward, turning a localized problem into a system-wide outage.

Circuit breakers are your first line of defense. After a service detects repeated failures from a downstream dependency, it stops trying—returning a cached response, a degraded experience, or a clear error immediately rather than waiting for timeouts that will never succeed. This gives the struggling service room to recover without drowning in queued requests. Libraries like Netflix's Hystrix (now in maintenance mode but conceptually influential) or Polly for .NET make circuit breakers straightforward to implement. The key is setting appropriate thresholds—too sensitive and you'll degrade service unnecessarily; too lenient and you'll let failures cascade.

Bulkheads provide isolation by limiting how many concurrent requests any single dependency can consume. Think of them like watertight compartments on a ship: if one service starts leaking memory or hanging connections, the bulkhead prevents it from draining resources the rest of your system needs. Combine this with timeouts set aggressively low—if your 99th percentile response time is 200ms, don't wait 30 seconds hoping the 1% case improves—and you've built resilient boundaries that fail fast instead of failing slowly and expensively.

Where Should You Log Errors for Maximum Observability?

Every service logging the same exception creates noise without clarity. The calling service should log that it received an error and what it decided to do about it. The service where the error originated should log the full context—stack traces, request details, database query plans—everything needed for root cause analysis. Middle services in a chain? They mostly need correlation IDs and pass-through information unless they transform the error or make recovery decisions.

Structured logging is non-negotiable. JSON logs with consistent field names let you aggregate and query across services. Include the same correlation ID in every log line from a single request, add service identifiers so you know which component generated each line, and tag errors with the same machine-readable codes you're sending in API responses. When your pager goes off at 3 AM, you need to trace a user's action through five services without reading raw text logs on five different dashboards.

Consider your retry behavior carefully in your logging strategy. Retries are good for reliability but dangerous for debugging if you don't track them. Log when you're retrying, why (which error triggered it), and how many attempts you've made. This visibility prevents the confusion of seeing the same error timestamp dozens of times and wondering if you're in a loop or just handling legitimate transient failures.

How Do You Handle Errors When Services Use Different Technologies?

Your Python service throws an exception with rich traceback information. Your calling JavaScript service receives an HTTP response and has to reverse-engineer what went wrong. Language boundaries strip away the rich error types you're used to working with locally. You can't catch a Python ValueError in TypeScript—you can only parse the response body and make educated guesses.

The solution is a shared error taxonomy that both sides understand. Define error categories in your API contracts—validation errors, not-found errors, authorization errors, transient failures—and map language-specific exceptions to these categories at service boundaries. A Go service might return sql.ErrNoRows while a Ruby service returns ActiveRecord::RecordNotFound, but both should translate to the same HTTP 404 with a consistent error code that calling services can handle uniformly.

Async communication—message queues, event streams—adds another layer of complexity. There's no waiting HTTP client to return an error code to, so failed message processing needs dead-letter queues and monitoring that surfaces problems without burying them in logs nobody reads. Design your consumers to be idempotent when possible; if processing fails midway and the message retries, you don't want partially applied state changes creating data corruption that hides the original error.

The best error handling doesn't just prevent crashes—it creates visibility. Every error should answer: what happened, where, why, and what should happen next.

Building reliable distributed systems means accepting that failures are normal, not exceptional. Networks partition. Services restart. Databases hit connection limits. Your error handling strategy should assume these things will happen and design for graceful degradation rather than perfect availability. The services that survive production traffic aren't the ones that never fail—they're the ones that fail predictably, communicate clearly about what went wrong, and contain damage before it spreads. Start by auditing your current error responses: do they include enough information for automated recovery? Can you trace a single user action across all your services? If a dependency fails right now, will you degrade gracefully or cascade into an outage? The answers to those questions matter more than perfect uptime metrics ever will.