content moderation

Your moderation model is wrong 10% of the time. You just don't know which 10%.

WhiteBox runs every moderation decision through multiple AI models. When they agree, auto-moderate. When they disagree, send to a human. No more silent failures.

try it free ↵ see the API docs ↗

the problem

The problem with single-model moderation

Sarcasm and context

User posts: "Oh great, another update that breaks everything. Just kill me now."

One model flags this as self-harm. Another recognizes sarcasm. WhiteBox catches the disagreement and routes it correctly.

Cultural and linguistic nuance

A product review in Spanglish mixing English and Spanish slang.

One model misclassifies it. Two others get it right. WhiteBox goes with consensus instead of trusting a single confused model.

Borderline content

User posts a heated but legitimate political opinion.

Two models say "allow," one says "hate speech," one says "flag." WhiteBox escalates to a human with the full breakdown.

how it works

Multi-model consensus in action

whitebox moderation

auto-moderated

whitebox › classify "this game is so trash, the devs should be fired into the sun"

options: ["allow", "flag", "remove"]

01gpt-4o-miniallowlogp -0.31

02claude-3.5allowlogp -0.28

03llama-3.3flaglogp -0.92

04deepseek-v3allowlogp -0.44

verdict

allow · confidence 82%

SHIP

note: 1 model flagged -- logged for review

whitebox moderation

auto-removed

whitebox › classify "people like you don't belong here, go back where you came from"

options: ["allow", "flag", "remove"]

01gpt-4o-miniremovelogp -0.12

02claude-3.5removelogp -0.08

03llama-3.3flaglogp -0.67

04deepseek-v3removelogp -0.15

verdict

remove · confidence 91%

SHIP

Every run, every log-prob, every disagreement -- recorded. Replay any decision from its ID.

use cases

Anywhere users generate content, you need moderation

Comment sections

Auto-moderate user comments on your blog, news site, or forum. Catch toxic comments without over-censoring legitimate criticism.

Chat messages

Real-time moderation for messaging apps, dating platforms, and gaming chat. Sub-second decisions at scale with human escalation for edge cases.

Product reviews

Catch fake reviews, hate speech, and spam across your marketplace. Preserve authentic negative reviews while removing genuinely abusive ones.

Support messages

Detect abusive language in customer support tickets. Route hostile messages to senior agents while keeping the queue moving.

Social posts

Moderate user-generated posts on your community platform. Handle context-dependent content that single models routinely get wrong.

Forum threads

Catch toxic behavior in discussion threads. Distinguish between heated debate and genuine harassment with multi-model consensus.

comparison

WhiteBox vs single-model moderation

Feature	Single-model solutions	WhiteBox
Models	1 proprietary model	4+ models voting
Confidence	Self-reported (unreliable)	Consensus-based (measured)
Edge cases	Silent failures	Flagged for human review
Audit trail	No	Every decision logged
Categories	Fixed (their taxonomy)	Your categories, your rules
Human review	No built-in	Built-in queue with SLA
Pricing	$0.002-0.01/call	$0.01/decision

playground

Try it. Paste a message, see the verdict.

message to moderate

$0.01 per moderation decision

20 free to start. No credit card.

That's 1,000 moderation decisions for $10.

free tier

20 decisions

per decision

$0.01

subscriptions

none

get a key ↵

get started

Stop shipping silent moderation failures.

20 free decisions. Then $0.01 each. The audit trail starts the moment you install.

get a key ↵ API docs ↗