Tracing a Request Through a Distributed System
When something goes wrong in a distributed system, the hardest part isn’t fixing it — it’s understanding what happened. This post walks through a technique for tracing a single request across multiple services.
The problem
You have a user-facing API that fans out to several internal services. A request comes in, something fails, and the logs across five different services are a jumble of timestamps and IDs that don’t obviously connect.
The request flow looks like this:
flowchart LR
Client -->|HTTP| API["API Gateway"]
API -->|gRPC| Auth["Auth Service"]
API -->|gRPC| Catalog["Catalog Service"]
Catalog -->|SQL| DB[(Database)]
API -->|event| Queue[("Message Queue")]
Queue --> Worker["Background Worker"]Propagating a trace ID
The simplest thing that works: generate a UUID at the edge and thread it through every hop.
package middleware
import (
"context"
"net/http"
"github.com/google/uuid"
)
type contextKey string
const TraceIDKey contextKey = "trace_id"
func TraceMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
traceID := r.Header.Get("X-Trace-ID")
if traceID == "" {
traceID = uuid.New().String()
}
ctx := context.WithValue(r.Context(), TraceIDKey, traceID)
w.Header().Set("X-Trace-ID", traceID)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
func TraceIDFromContext(ctx context.Context) string {
if id, ok := ctx.Value(TraceIDKey).(string); ok {
return id
}
return ""
}
Every service logs with this ID:
log.Info("request received",
"trace_id", middleware.TraceIDFromContext(ctx),
"service", "catalog",
"method", "GetProduct",
)
Querying across services
With structured logs and a consistent trace_id field, you can pull the full story:
# Loki / LogQL
{app=~"api|auth|catalog|worker"} | json | trace_id = "abc-123"
# or with grep if you're old-fashioned
grep "abc-123" /var/log/services/*.log | sort -t'T' -k2
What comes next
Once you have trace IDs flowing, you’re one step away from proper distributed tracing with OpenTelemetry. The manual approach above is a good way to understand what tracing actually does before adding the framework overhead.
The state machine of a typical request looks like this:
stateDiagram-v2
[*] --> Received
Received --> Authenticated: auth ok
Received --> Failed: auth error
Authenticated --> Processing
Processing --> Queued: async path
Processing --> Responded: sync path
Queued --> Completed
Responded --> [*]
Completed --> [*]
Failed --> [*]Start simple. Add structure. Then add tooling.