All matters§Matter III

Evidence That Holds

Building artifact pipelines where every record has to be traceable, timestamped, and reviewable by someone who wasn't there.

Filed: 2026-04
Jurisdiction: DE · Evidentiary
Practice: legaltech
Stack: Typed schemas · Audit trails · ETL
Read: 7 min

The problem

Every piece of software that interacts with the legal system eventually runs into a quiet, devastating question. It is usually asked by a lawyer, in a calm voice, in an email with no exclamation marks:

Where did this record come from, and can you prove it?

If the answer is "it came out of a scraper six months ago and we've done some processing on it since," the record is not evidence. It is a rumor. A well-meaning, carefully-indexed, richly-joined rumor, but a rumor nonetheless. The courts have a word for this class of artifact — hearsay — and they have extremely developed opinions about it.

The problem is not that civic-tech systems lie. It is that they very rarely design for the moment, months after the fact, when someone asks them to testify about what they did. Most data pipelines are built for answering questions today, not for reconstructing what was true yesterday. The two are different products, and the gap between them is where most civic-tech platforms quietly disqualify themselves from being useful in the one setting where they could matter most.

The constraint

Three constraints define the problem:

Originality is non-negotiable. For a record to be defensible, you have to be able to demonstrate, with evidence of your own, that the thing you're showing a reviewer is the same thing you captured originally. A cleaned, enriched, joined version is useful for product features. It is not useful for a legal review six months later unless the original is still somewhere.
Time is not a column, it is a dimension. Knowing when a record was observed is different from knowing when it was created, which is different from knowing when it was last modified, which is different from knowing when your processing pipeline first touched it. All four matter. Collapsing them into a single created_at field is the single most common mistake I see.
Provenance has to survive a rebuild. If your database fills up and you have to migrate, or your pipeline gets refactored, or you reprocess six months of history with a new schema, the provenance of every record has to be intact on the other side. This is a schema design problem, not an ops problem.

The approach

The working pattern — evidence-grade data, if you want a name — rests on four rules that sound simple and are unpleasant to enforce.

Rule 1: Capture is immutable. Everything else is a projection. Every record your pipeline touches has exactly one origin event: the moment it was first observed by your system. That event produces an immutable capture artifact — raw, untouched, written once — with a source identifier, a timestamp (UTC, with timezone offset preserved), and a monotonic sequence number. Everything downstream — parsing, enrichment, normalization, indexing, deduplication — is a projection of that capture, not a replacement for it. The original is not thrown away. Ever.

This rule is annoying. It means more storage. It means two writes instead of one. It means the "cleaned" table and the "raw" table have to stay in sync. Every team I've seen try to skip this rule has regretted it within eighteen months.

Rule 2: Every record carries its pedigree. Every row in every table — not just the capture table — knows where it came from. That means a foreign key (or a stable content hash) back to the original capture, plus a version identifier for the schema and the pipeline code that produced it. When someone asks how did this record get this value, you can answer by walking the pedigree backwards through versioned transformations to the original event.

Rule 3: Time is a first-class tuple, not a timestamp. Every record carries at least: observed_at (when your system saw it), occurred_at (when the underlying event happened, to the best of your knowledge), and ingested_at (when it entered your database). For many domains you'll also want modified_at and deleted_at as separate fields rather than a flag. Collapsing these is a compression of truth, and compressions of truth are exactly what lawyers interrogate.

Rule 4: Exports are reproducible by construction. Every report or export your system generates carries its own metadata: the query that produced it, the schema version of the underlying data, the capture range it drew from, and a hash of the output. Six months later, someone handed the export should be able to regenerate it — bit-for-bit — by running the same query against the same capture range at the same schema version. If they can't, the export isn't evidence, it's decoration.

The build

The scaffolding this produces, in practice:

A raw capture store. Append-only, never updated, never deleted without a formal retention policy. One row per original observation. Compressed cold storage is fine. Read latency is not a requirement.
A normalized store. The working tables your product actually queries. Each row carries a foreign key or content hash to the raw capture plus a pipeline version. Can be rebuilt from the capture store at any time.
A schema registry. A versioned record of every shape the normalized store has taken over time. New schemas are additive; destructive migrations are rare and accompanied by a note explaining the legal reasoning for the loss.
An export generator. Reports, case packages, legal-desk submissions — all generated through the same pipeline, each one carrying the self-describing metadata that lets it be reproduced later. Never hand-assembled.
An audit log. Who ran which export, against which capture range, producing which hash, at what time. The audit log is itself captured to the raw store, because audit logs become evidence too.

None of this is unusual in regulated industries — banking, healthcare, aviation all do something similar. What's unusual is seeing it in a civic-tech startup that's focused on shipping. The absence is almost always forgivable until it isn't.

The outcome

The difference, when you live with this pattern for a while, is that the quiet question from the lawyer no longer causes panic. When someone asks where did this come from, the answer is a URL, a timestamp, and a diff against the schema version in effect at the time. When someone asks can you prove it, the answer is run this query against this range, here is the hash of the output. The question is no longer a threat. It's a lookup.

The other outcome — the one that matters more for the product — is that the team stops arguing about the data. When a user says "this number is wrong," the conversation is not a debate; it is a walk back through the pedigree until you find the disagreement. Most of the time the number is not wrong. The fraction of the time that it is, you find out why in minutes instead of days.

Aftermath

What I'd insist on, next time, from day one:

A retention policy with legal authorship. I used to leave retention rules to the ops team. I will not again. Retention decisions — what gets kept forever, what ages out, what gets anonymized, what gets destroyed — are legal decisions with operational consequences. They should be written down before the first capture, signed off by whoever is going to carry the legal risk, and revisited on a schedule.

Schema versioning before schema design. Every table gets a pipeline version column before it gets its first real column. The version is added during the first migration, not the fifth, and every row written through the pipeline is tagged with the version of the pipeline code that produced it. Retrofitting this later is a month of work. Doing it first is an afternoon.

A regular reproducibility drill. Once a quarter, take a real export from three months ago and try to reproduce it from the capture store and the schema registry. If you can't, you have a problem that will only get worse. Find it in the drill, not the deposition.

The largest lesson, which I will put plainly and not repeat: in any domain where software touches the legal system, the data pipeline is the product. The user interface is a lens on it. The features are convenience layers above it. If the pipeline cannot answer the lawyer's question, none of the work above it matters — which means the pipeline is where the discipline has to live. Build for the question you hope no one asks, and the features you ship on top will have something real to stand on.