All matters§Matter I

The Illegal Comment Problem

Detecting criminally illegal speech in comment threads without drowning the humans who have to sign off on removal.

Filed
2026-03
Jurisdiction
DE · StGB
Practice
civic tech
Stack
LLM pipelines · NestJS · Next.js · Firebase
Read
6 min

The problem

German criminal law takes online speech more seriously than most jurisdictions. §130 (Volksverhetzung — incitement to hatred), §185 (Beleidigung — insult), §186 (üble Nachrede — defamation), and §187 (Verleumdung — calumny) all apply to a comment on a politician's TikTok video the same way they apply to a pamphlet handed out in a market square. There are real consequences, real prosecutions, and real removal obligations under NetzDG and its successors.

The operational problem is not, as the press sometimes frames it, that hate speech is hard to spot. Any fluent German speaker can read a thread and tell you what crossed the line. The problem is scale, context, and the cost of being wrong in either direction.

A mid-sized political account can attract thousands of comments on a single post. A small team of social-media managers has neither the time nor the legal training to read each one carefully, let alone build a case for removal. Meanwhile every false positive — flagging legal speech as illegal — erodes the team's ability to act on the real violations. Over-flagging is worse than under-flagging, because it poisons the review queue and trains the humans to stop trusting it.

So the problem reframes: how do you detect criminally illegal speech in high-volume comment threads in a way that produces a review queue a human will actually work through — and act on — on a Tuesday afternoon?

The constraint

Four constraints shape everything:

  1. The statutes are fact-specific. §185 depends on intent and context. §187 requires a false assertion of fact. §130 has a public-order threshold. A model that classifies by keyword is not just wrong, it's wrong in a way that gets the user sued for over-removal.
  2. The reviewer is not a lawyer. The output has to make sense to a press officer or a community manager, not a paralegal. If the interface requires reading five pages of reasoning to act on a flag, it will not be acted on.
  3. The goal is removal, not detection. Detection is a vanity metric. What the platform owner actually cares about is: did the illegal comment come down? Everything upstream of that question is overhead.
  4. Cost is not unlimited. LLM inference is a line item. A civic-tech product that costs more per month than a junior reviewer is a product nobody adopts.

The approach

Two moves, one principle.

Move 1: Two-stage assessment. Every comment runs through a cheap, fast coarse classifier first — something small enough that you can pass a full thread through it for the price of a coffee. Its only job is to decide probably clean / probably worth a second look. Maybe 90–95% of comments stop there.

The smaller set that survives goes to a fine classifier: a larger model with a structured prompt that applies each statute's test as a decision procedure, not a vibes check. §186 asks about factual assertions; so the fine stage extracts the factual claim and checks whether it's presented as fact or opinion. §185 asks about the target of the insult; so the fine stage identifies who's being addressed and whether the expression is directed at them personally. The statute's structure becomes the prompt's structure.

This is not clever. It's the same pattern a law student uses in an exam. The model is just forced to do the steps in the right order before emitting a judgment.

Move 2: Score, don't label. The output is not "legal" or "illegal." It's a structured record per statute: which elements of the offense are present, which are absent, which are unclear, and how confident the model is. A single severity score falls out of that for sorting the queue, but the reviewer can always see the reasoning behind it. A reviewer who disagrees with the model can disagree with a specific element, not a black box.

The principle: optimize for removal outcomes. A review queue that produces 100 flags, of which 60 get removed, is better than one that produces 500 flags, of which 80 get removed. The first queue is workable. The second is a swamp. Every decision in the system — the coarse threshold, the severity sort, the UI's default ordering — is tuned to maximize the number of acted-on flags per unit of reviewer attention, not the number of flags in total.

This reframing changes everything. Recall is no longer the headline metric. Removal rate per flag is.

The build

The working architecture, abstracted:

  • A comment ingestion service that pulls threads from a platform and stores raw records with a source-and-timestamp, always.
  • A queue that dispatches each comment through the coarse classifier.
  • A second queue for comments that survive the coarse pass, routed to the fine classifier.
  • A normalized assessment record per comment per statute, versioned with the prompt hash.
  • A reviewer UI that surfaces the structured assessment next to the comment, lets a human confirm or override, and emits a removal request to the platform when confirmed.
  • A report generator that packages confirmed removals into the format a prosecutor or a platform's legal desk can actually receive.

Nothing here is novel. The two-stage pattern is widely discussed. What's unusual is the discipline about what the pipeline is for. Every component can be evaluated against a single question: does it move a flag closer to being acted on?

The outcome

The version of this system I work on is not perfect — I doubt any version ever will be — but it has the property that a non-lawyer can use it to produce a legally coherent removal request in well under a minute per comment. That number used to be closer to five, and five minutes is the difference between a reviewer working through a queue and a reviewer closing the tab.

The quieter outcome is that the product team finally has a metric that means something. When a sprint's goal is "improve removal rate per flag by five points," everyone knows what that means and nobody needs to argue about it.

Aftermath

Three things I'd do differently, or sooner.

Ground the statutes in a living glossary. The first version of the fine classifier's prompts hardcoded the statute definitions. The second version pulled them from a version-controlled glossary that legal advisors could edit without touching code. That should have been there from day one.

Make the assessment record queryable. The first schema treated each assessment as a blob. The second version flattened the per-element decisions into columns. Being able to ask "show me all flags where §185's target element was ambiguous" turned out to be the most useful operational query we had. Design for that from the start.

Stop treating the reviewer as a bottleneck. For a while, the mental model was that the LLM was doing the work and the human was a necessary checkpoint. That framing made the reviewer's experience worse every sprint. The better framing: the reviewer is the product. The LLM is an assistant that exists to make their judgment faster. Once we flipped that, every UI decision got easier.

The larger lesson, for anyone working in this corner: detection is not the problem. Workflow-grade review is the problem. If you build the model first and the workflow second, you will spend the next year rebuilding the model.