Tracing a Request Through a Distributed System

When something goes wrong in a distributed system, the hardest part isn’t fixing it — it’s understanding what happened. This post walks through a technique for tracing a single request across multiple services.

The problem

You have a user-facing API that fans out to several internal services. A request comes in, something fails, and the logs across five different services are a jumble of timestamps and IDs that don’t obviously connect.

The request flow looks like this:

flowchart LR
    Client -->|HTTP| API["API Gateway"]
    API -->|gRPC| Auth["Auth Service"]
    API -->|gRPC| Catalog["Catalog Service"]
    Catalog -->|SQL| DB[(Database)]
    API -->|event| Queue[("Message Queue")]
    Queue --> Worker["Background Worker"]

Propagating a trace ID

The simplest thing that works: generate a UUID at the edge and thread it through every hop.

package middleware

import (
	"context"
	"net/http"

	"github.com/google/uuid"
)

type contextKey string

const TraceIDKey contextKey = "trace_id"

func TraceMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		traceID := r.Header.Get("X-Trace-ID")
		if traceID == "" {
			traceID = uuid.New().String()
		}

		ctx := context.WithValue(r.Context(), TraceIDKey, traceID)
		w.Header().Set("X-Trace-ID", traceID)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

func TraceIDFromContext(ctx context.Context) string {
	if id, ok := ctx.Value(TraceIDKey).(string); ok {
		return id
	}
	return ""
}

Every service logs with this ID:

log.Info("request received",
    "trace_id", middleware.TraceIDFromContext(ctx),
    "service",   "catalog",
    "method",    "GetProduct",
)

Querying across services

With structured logs and a consistent trace_id field, you can pull the full story:

# Loki / LogQL
{app=~"api|auth|catalog|worker"} | json | trace_id = "abc-123"

# or with grep if you're old-fashioned
grep "abc-123" /var/log/services/*.log | sort -t'T' -k2

What comes next

Once you have trace IDs flowing, you’re one step away from proper distributed tracing with OpenTelemetry. The manual approach above is a good way to understand what tracing actually does before adding the framework overhead.

The state machine of a typical request looks like this:

stateDiagram-v2
    [*] --> Received
    Received --> Authenticated: auth ok
    Received --> Failed: auth error
    Authenticated --> Processing
    Processing --> Queued: async path
    Processing --> Responded: sync path
    Queued --> Completed
    Responded --> [*]
    Completed --> [*]
    Failed --> [*]

Start simple. Add structure. Then add tooling.