content moderation

Your moderation model is wrong 10% of the time. You just don't know which 10%.

WhiteBox runs every moderation decision through multiple AI models. When they agree, auto-moderate. When they disagree, send to a human. No more silent failures.

the problem

The problem with single-model moderation

01
Sarcasm and context

User posts: "Oh great, another update that breaks everything. Just kill me now."

One model flags this as self-harm. Another recognizes sarcasm. WhiteBox catches the disagreement and routes it correctly.

02
Cultural and linguistic nuance

A product review in Spanglish mixing English and Spanish slang.

One model misclassifies it. Two others get it right. WhiteBox goes with consensus instead of trusting a single confused model.

03
Borderline content

User posts a heated but legitimate political opinion.

Two models say "allow," one says "hate speech," one says "flag." WhiteBox escalates to a human with the full breakdown.

how it works

Multi-model consensus in action

whitebox moderation
auto-moderated
whitebox classify "this game is so trash, the devs should be fired into the sun"
options: ["allow", "flag", "remove"]
01gpt-4o-miniallowlogp -0.31
02claude-3.5allowlogp -0.28
03llama-3.3flaglogp -0.92
04deepseek-v3allowlogp -0.44
verdict
allow · confidence 82%
SHIP
note: 1 model flagged -- logged for review
whitebox moderation
auto-removed
whitebox classify "people like you don't belong here, go back where you came from"
options: ["allow", "flag", "remove"]
01gpt-4o-miniremovelogp -0.12
02claude-3.5removelogp -0.08
03llama-3.3flaglogp -0.67
04deepseek-v3removelogp -0.15
verdict
remove · confidence 91%
SHIP

Every run, every log-prob, every disagreement -- recorded. Replay any decision from its ID.

use cases

Anywhere users generate content, you need moderation

01
Comment sections

Auto-moderate user comments on your blog, news site, or forum. Catch toxic comments without over-censoring legitimate criticism.

02
Chat messages

Real-time moderation for messaging apps, dating platforms, and gaming chat. Sub-second decisions at scale with human escalation for edge cases.

03
Product reviews

Catch fake reviews, hate speech, and spam across your marketplace. Preserve authentic negative reviews while removing genuinely abusive ones.

04
Support messages

Detect abusive language in customer support tickets. Route hostile messages to senior agents while keeping the queue moving.

05
Social posts

Moderate user-generated posts on your community platform. Handle context-dependent content that single models routinely get wrong.

06
Forum threads

Catch toxic behavior in discussion threads. Distinguish between heated debate and genuine harassment with multi-model consensus.

comparison

WhiteBox vs single-model moderation

Feature Single-model solutions WhiteBox
Models 1 proprietary model 4+ models voting
Confidence Self-reported (unreliable) Consensus-based (measured)
Edge cases Silent failures Flagged for human review
Audit trail No Every decision logged
Categories Fixed (their taxonomy) Your categories, your rules
Human review No built-in Built-in queue with SLA
Pricing $0.002-0.01/call $0.01/decision
playground

Try it. Paste a message, see the verdict.

allow flag remove
whitebox sandbox · simulated client-side
[--:--:--] waiting · press run moderation to dispatch
models
4
median latency
0.8s
cost / decision
$0.01
audit retention
forever
pricing

$0.01 per moderation decision

20 free to start. No credit card.

That's 1,000 moderation decisions for $10.

free tier
20 decisions
per decision
$0.01
subscriptions
none
get a key
get started

Stop shipping silent moderation failures.

20 free decisions. Then $0.01 each. The audit trail starts the moment you install.

get a key API docs