Jaeger ClickHouse backend makes trace storage a database design problem

Distributed tracing has always had a quiet storage problem. A trace looks like a debugging artifact when you are staring at one slow request. At fleet scale, it is an append-only event stream with awkward dimensions: service, operation, duration, timestamp, trace ID, span attributes, resource attributes, events, links, and enough repetition to make a columnar database start smiling.

That is why CNCF's write-up on Jaeger's new ClickHouse backend is worth more than the headline number. The headline is good: a benchmark with 10 million spans across 1 million traces reported an 8.6x compression ratio on the spans table, reducing roughly 5.99 GiB of uncompressed span data to about 722 MiB. The quieter story is better: Jaeger is treating trace storage as a database design problem again, not just as a place to throw telemetry until retention gets expensive.

Observability gets cheaper when storage understands the questions operators actually ask.

The useful number is not only compression

Compression matters because traces are repetitive. The same service names, operation names, status codes, attribute keys, and deployment labels appear over and over. A row-oriented layout can store that repetition honestly and expensively. A columnar layout can group similar values, compress them aggressively, and avoid reading columns that a query does not need.

But compression by itself is not the win. Cold data that nobody can search is just a smaller liability. The benchmark numbers are interesting because they pair storage reduction with interactive query behavior: around 52k spans per second during ingestion on the reported single-node setup, roughly 100 ms trace retrieval by ID, and most structural searches landing well below a second. That is the shape operators want. Keep enough traces to debug, search them without waiting forever, and avoid paying index tax on every field as if every attribute deserves first-class treatment.

trace storage is not one query

get trace by id
search service + operation + time
filter by duration
filter by attributes
list services and operations
compute service performance metrics

Each query pulls the schema in a different direction. The backend has to choose which path gets the fastest lane.

The primary key is the argument

The most instructive part of the design is the primary key decision. In ClickHouse, the primary key describes sort order and sparse indexing behavior. It is not a uniqueness rule. That means it becomes a statement about the queries you expect to protect.

Jaeger's ClickHouse backend had an obvious temptation: sort by trace_id. That makes a full trace retrieval neat because spans from the same trace land near each other. The problem is that the Jaeger UI and most incident work do not begin with a known trace ID. They begin with a service, an operation, a time window, a duration threshold, or a suspicious attribute.

The backend instead sorts around service_name, name, and start_time. That favors search. Trace retrieval gets help from a bloom filter skip index on trace_id and a materialized view that stores trace time bounds. In plain terms: the system optimizes for finding the trace, then adds enough auxiliary structure to make fetching the trace acceptable.

That is a healthy trade. A trace store that can instantly fetch a trace ID nobody has found yet is not very useful during an outage. Most debugging starts as a narrowing operation.

Typed attributes are where the bill hides

The other useful detail is typed attributes. Modern OpenTelemetry data is not just string tags attached to spans. Attributes can be booleans, integers, floats, strings, bytes, arrays, maps, and they can live at different levels: resource, scope, span, event, or link.

That makes a simple-looking filter like http.status_code = 500 more annoying than it appears. Is the value a string or an integer? Was it recorded on the span or the resource? Did one service emit it one way and another service emit it differently? If the storage layer collapses everything into strings, it can simplify writes while making query semantics less precise. If it preserves type and level, it has to carry more metadata.

The ClickHouse implementation leans into that metadata. It stores values in type-specific structures and maintains attribute metadata so the reader can ask the right columns instead of blindly scanning every possible place an attribute might live.

Operational note: attribute-only searches remain expensive because they cannot fully use the primary index. The practical pattern is still to combine attributes with service, operation, or time filters. That is the database telling the operator which predicates are structural and which ones are late-stage refinements.

Materialized views are part of the product

Tracing systems are judged by workflow latency, not only database latency. The UI needs fast lists of services and operations. Search often needs trace time bounds. Service performance monitoring wants latency, call rate, and error-rate signals from the same stored spans. If those answers require wide scans every time, the system feels slow even when individual inserts look fine.

ClickHouse materialized views let Jaeger precompute some of those shapes at write time. That changes the cost model. Instead of pretending the spans table can answer every question equally well, the backend admits that some questions deserve derived tables because they are part of the product experience.

question                         storage pressure
services list                    precompute small lookup state
service + time search            favor structural search keys
trace ID fetch                   use skip indexes and time bounds
arbitrary attribute filter       constrain first, then refine

The practical design is not one universal index. It is a set of cheap paths for the workflows operators repeat.

Alpha means design is still moving

There is one important caveat: this is alpha ClickHouse support in Jaeger v2.18.0. That matters. The benchmark is a useful signal, not a universal promise. A single-node dataset with controlled shape is not your production fleet with uneven services, noisy attributes, retention rules, compaction pressure, and dashboards built by people who discover new cardinality accidents every Friday afternoon.

Still, alpha does not make the architecture uninteresting. It makes the architecture worth watching. The project is now encoding a specific view of what trace storage should optimize for: cheap repetition, fast structural search, careful typed attributes, and derived state for common UI paths.

That is a better conversation than whether observability is expensive in the abstract. Of course it is. The sharper question is whether your storage layer is spending money on the parts of observability that shorten incidents.

The takeaway

The ClickHouse backend is not just another Jaeger storage option. It is a reminder that telemetry is data, and data systems reward honest modeling. If traces are mostly searched by service, operation, and time, make that cheap. If trace IDs are needed after search, give them supporting indexes. If attributes are typed and scattered across levels, preserve enough metadata to query them correctly. If the UI asks the same list questions all day, precompute them.

Observability platforms often sell the feeling that every question can be free if the pipeline is clever enough. Databases are less sentimental. They make you choose. Jaeger's ClickHouse work is interesting because it chooses in public.

That is the real lesson hiding behind the 8.6x number: the storage layout is an operating model. It decides which questions are first-class, which questions are expensive, and how long your team has to wait while a bad deploy is still unfolding.

Trace Storage Is a Database Problem Again