OpenTelemetry and service-mesh metrics close different observability gaps

OpenTelemetry has made application telemetry feel less like a vendor bet and more like ordinary engineering plumbing. Instrument the code, export metrics, traces, and logs, then send them through a Collector into whatever backend the team trusts. That is a solid operating model. It is also incomplete.

The CNCF post on OTel and mesh-derived metrics makes the gap concrete. If your applications emit good OpenTelemetry data, you can see what the code knows about itself: business counters, custom dimensions, spans, errors, and internal timing. A service mesh sees a different layer. Linkerd's proxy sits on the request path and observes east-west service traffic directly, including request counts, response classifications, latency, TCP activity, and mTLS identity. The app does not have to be changed for the proxy to see that traffic.

Application telemetry tells you what the code reports. Mesh telemetry tells you what the service boundary experienced.

Two Truths, One Incident

The useful point is not that mesh metrics are better than application metrics. They are measuring a different thing. Application metrics carry domain meaning. Only the checkout service can know that a request represented a payment attempt, a cart update, or a recommendation event. Only the app can attach the labels that come from product logic, tenant behavior, and feature flags.

The proxy has the advantage at the boundary. It can count every meshed request without waiting for a developer to add instrumentation. It can attach caller identity from mTLS. It can observe the destination workload, response classification, HTTP status, gRPC status, and network-level timing. That makes it useful for questions that sit between teams: did this service actually talk to that one, did the call fail, which identity made it, and did latency appear before or inside the application?

application layer: business metric, span, error detail, custom labels
mesh layer: caller identity, east-west request rate, success rate, latency
trace layer: causal path through the services involved

The layers overlap, but they do not replace each other.

The gRPC Trap Is a Good Example

One of the more practical details in the CNCF article is a failure shape that many dashboards miss. A call can carry an HTTP status that looks successful while the gRPC status in the trailers reports failure. If an alert only watches HTTP status codes, that failure can stay quiet. Linkerd's proxy can classify the response using gRPC status and mark the call as failed. The mesh can say the boundary failed; a distributed trace can then show the exact span and error that explain why.

That is the right division of labor. The mesh should be good at broad, low-friction service-boundary visibility. Traces should be good at root cause. Application metrics should be good at business and code-level semantics. Teams get into trouble when they ask one layer to pretend it is all three.

The Collector Is the Control Point

The reference stack in the CNCF post uses the OpenTelemetry Collector to add mesh metrics without disturbing the existing application metrics path. A dedicated Prometheus receiver discovers Linkerd proxy containers, scrapes the proxy metrics port, filters the metric families, adds a layer=mesh resource attribute, enriches the data with Kubernetes metadata, and sends it to a metrics backend. Grafana can then compare mesh latency and app latency in the same view.

That pattern matters more than the specific backend. The article uses VictoriaMetrics and Grafana, but the important architectural move is keeping the mesh pipeline explicit. The team can see which data came from the proxy, which data came from app instrumentation, and which processor added Kubernetes context. That is easier to reason about than blending every series into one undifferentiated metrics soup.

linkerd-proxy scrape
  -> filter kept metric families
  -> add layer=mesh
  -> attach Kubernetes metadata
  -> write to metrics backend
  -> compare with app OTel series

A separate mesh pipeline makes the source of truth visible instead of magical.

Cardinality Is Still the Bill

The less glamorous warning is the one operators should read twice. Proxy metrics can carry a lot of labels. The CNCF lab found a single latency histogram series carrying dozens of labels after proxy output and Kubernetes enrichment were combined. Histograms multiply that cost again because each bucket is its own series. Filtering metric names helps, but it does not erase cardinality inside the metric families you keep.

This is where observability architecture becomes cost control. Scraping every proxy metric because it is available is not a strategy. Keeping only the families that support the workflow is a strategy. The CNCF post focuses on request and response counts, response latency, and selected TCP metrics. It also calls out the difference between filtering at scrape time and filtering later with OTTL in the Collector. Those choices decide how many samples enter the pipeline and how many labels survive into storage.

Operational note: a mesh pipeline should be designed as a budgeted data product. Define the questions first, then keep the metric families and labels that answer them.

The Better Mental Model

The clean mental model is simple. OpenTelemetry tells you what the application chose, or was configured, to emit. A service mesh tells you what crossed the service boundary. Distributed traces connect the evidence into a path. None of those layers should be treated as optional if the system is large enough that teams debug across service ownership lines.

This is especially relevant for platforms trying to make observability a paved road. Developers should not have to rediscover every boundary metric by hand. Platform teams can provide a mesh metrics pipeline that gives every meshed workload baseline request rate, success rate, latency, caller identity, and Kubernetes context. Developers can then add application metrics where domain meaning matters. The result is not less instrumentation. It is better placement of instrumentation.

Use mesh metrics for east-west traffic, identity, coarse health, and service-boundary latency.
Use app metrics for business state, product events, and custom dimensions.
Use traces for causality, root cause, and the exact failing span.
Use the Collector to keep provenance, filtering, enrichment, and export policy explicit.

The Takeaway

The CNCF reference is useful because it turns a common observability slogan into an operational pattern. Full-stack visibility is not one giant dashboard. It is a set of measurement layers with clear ownership. The mesh watches the boundary. The app reports its intent and domain state. Traces connect the path. The Collector makes the data path auditable.

That is the practical lesson for cloud-native teams already running OpenTelemetry. If the pipeline sees only what applications emit, it can miss failures that happen at the service boundary or in protocol details like gRPC trailers. If the mesh sees only the boundary, it can flag failure without explaining the business or code-level cause. Put both in the same observability system, label them honestly, and the incident conversation gets shorter.

The mesh does not know everything. It knows something your app often does not: what actually happened between services.

The Mesh Knows What Your App Does Not