All casework§Casework No. 001

Why we optimize for removal, not detection

A detection rate is a vanity metric when the real goal is a comment that's no longer there.

Filed: 2026-04-05
Read: 3 min
Tags: product metrics · civic tech · public interest

In any system that flags bad content, the first metric you reach for is detection rate. It is easy to measure, easy to chart, and easy to improve. Give your model a bigger training set, tune the threshold, and the number goes up. The demo slide writes itself.

Detection is also, in almost every product I have worked on, the wrong thing to optimize.

The goal of a hate-speech detection system on a political account's comment threads is not to find illegal comments. It is for illegal comments to not be on the internet. Those are different. A detection that never results in a removal is a curiosity. A removal that happens because of a detection is the product working.

The subtle point is that these two numbers pull in opposite directions once you pay attention. If you optimize detection, you will flag more comments, which will grow the review queue, which will exhaust the humans who have to act on it, which will lower the fraction of flags that get acted on. Your detection chart goes up and to the right while your removal rate quietly collapses.

What the good metric looks like

The metric I argue for, in every version of this work I've done: removal rate per flag, weighted by severity, measured over a window long enough to catch a reviewer's actual working rhythm.

Removal rate per flag reframes the entire product. It is a metric that punishes false positives (they dilute the pool), rewards ranking (putting the most-removable flags at the top of the queue), and aligns the engineering team with the humans in the loop rather than against them.

It is also a metric that cannot be gamed by a better model alone. You cannot improve removal rate per flag by training a bigger classifier. You improve it by thinking harder about which flags a reviewer can act on quickly, how the queue is sorted, how the UI surfaces reasoning, how the removal action is wired into the platform. Every part of the product gets to participate.

The cost of getting this wrong

A team I'm familiar with spent a sprint celebrating a recall improvement. The detection number went up by eight points. The review team, at the same time, quietly stopped touching the queue. The new recall had pulled in a class of borderline comments that were technically within scope but not obviously removable. Every one of them took the reviewer a minute and a half of context-reading to decide, and the decision was probably not. After a week the reviewers began skipping the queue. After two weeks the queue was a graveyard.

The detection chart was beautiful. The removal chart was gone.

The fix was not a different model. The fix was to re-tune the coarse classifier to a higher-precision regime and to surface, in the queue UI, a signal for "this one looks borderline — skip if tired." Detection dropped. Removal recovered. Nobody on the product side argued for bringing back the recall improvement.

The rule

When your product sits upstream of a human action, your primary metric is the rate at which that human acts. Everything else is diagnostic.

For detection systems, that means removals per flag. For intake systems, it means completed filings per started filing. For legal research tools, it means citations pasted into a brief per session. The shape is the same every time: walk downstream until you find the action the user came to take, measure that, and let the upstream numbers serve the outcome instead of substituting for it.

The temptation to measure the thing you can measure instead of the thing you want is almost irresistible. It feels rigorous. It produces clean charts. It also, very reliably, builds products that do the wrong job well.