The $974 Question: Human Moderators Outperform AI in Brand Safety, But at 40x the Cost
Share this article
Brands face an escalating and expensive battle: keeping their advertisements away from toxic online content that could irreparably damage their reputation. New research exposes the stark reality of this fight – human moderators remain significantly more accurate than cutting-edge AI, but their precision comes with a jaw-dropping cost multiplier.
A preprint study from researchers affiliated with brand safety company Zefr, accepted at the CVAM workshop (ICCV 2025), delivers a meticulous cost-benefit analysis. The team evaluated six leading multimodal large language models (MLLMs) – including variants of OpenAI's GPT-4o, Google's Gemini family (1.5-Flash, 2.0-Flash, 2.0-Flash-Lite), and Meta's Llama-3.2-11B-Vision – against human reviewers. They used a dataset of 1,500 videos categorized into high-risk areas: Drugs, Alcohol & Tobacco (DAT); Death, Injury & Military Conflict (DIMC); and Kid’s Content.
Performance: Humans Still Reign Supreme
The results, measured by precision (correct positive identifications), recall (finding all true positives), and the combined F1 score, were unambiguous:
| Model | Precision | Recall | F1 |
|---|---|---|---|
| GPT-4o | 0.94 | 0.83 | 0.87 |
| GPT-4o-mini | 0.92 | 0.85 | 0.88 |
| Gemini-1.5-Flash | 0.86 | 0.96 | 0.90 |
| Gemini-2.0-Flash | 0.84 | 0.98 | 0.91 |
| Gemini-2.0-Flash-Lite | 0.87 | 0.95 | 0.91 |
| Llama-3.2-11B-Vision | 0.87 | 0.86 | 0.86 |
| Human Reviewers | 0.98 | 0.97 | 0.98 |
"These results underscore the effectiveness of MLLMs in automating content moderation but also highlight the continued superiority of human reviewers in accuracy, particularly in more complex or nuanced classifications where context and deep understanding are required," the researchers state. Google's Gemini models (especially the Flash variants) emerged as the top-performing AI, achieving F1 scores around 0.91 – impressive, but still a notable 7% below humans. Crucially, the study found that smaller, cheaper models often performed nearly as well as their larger, more expensive counterparts.
The Staggering Cost Differential
Where the picture becomes critical for budget-conscious brands is the cost analysis. The researchers calculated the expense of moderating the 1,500-video dataset:
| Model | F1 | Cost |
|---|---|---|
| GPT-4o | 0.87 | $419 |
| GPT-4o-mini | 0.88 | $25 |
| Gemini-1.5-Flash | 0.90 | $28 |
| Gemini-2.0-Flash | 0.91 | $56 |
| Gemini-2.0-Flash-Lite | 0.91 | $28 |
| Llama-3.2-11B-Vision | 0.86 | $459 |
| Human Reviewers | 0.98 | $974 |
Human moderation, while achieving near-perfect accuracy (0.98 F1), costs a staggering $974 for the dataset – approximately 40 times more than the most cost-efficient AI options (Gemini-1.5-Flash and Gemini-2.0-Flash-Lite at $28). This presents a brutal equation for advertisers: pay a massive premium for human-level nuance or accept a small but significant drop in accuracy for substantial savings.
AI's Achilles' Heel: Context and Language
The study also pinpointed key weaknesses in the AI models:
1. Incorrect Associations: Models frequently made flawed connections, like flagging a Japanese-language video about caffeine addiction under the "Drugs" category due to the word "addiction," despite the context.
2. Lack of Contextual Nuance: Understanding subtle or complex scenarios where visual, audio, and text elements interact in non-obvious ways proved challenging.
3. Language Gaps: Performance notably degraded for non-English content, highlighting limitations in the models' multilingual training and comprehension.
"We showed that the compact MLLMs offer a significantly cheaper alternative compared to their larger counterparts without sacrificing accuracy," the authors conclude. "However, human reviewers remain superior in accuracy, particularly in complex or nuanced classifications."
The Hybrid Path Forward
This research doesn't advocate for abandoning AI; it underscores the need for strategic deployment. Jon Morra, Zefr's Chief AI Officer, emphasized the practical implication: "While multimodal large language models like Gemini and GPT can handle brand safety video moderation across text, audio and visuals with surprising accuracy and far lower costs than human reviewers alone, they still fall short on nuanced, context-heavy cases – making a hybrid human and AI approach the most effective and economical path forward."
For developers and tech leaders building or implementing brand safety solutions, the message is clear: leverage cost-effective AI models for the bulk of straightforward content screening, but reserve budget and design systems that seamlessly escalate complex, ambiguous, or high-stakes cases to human experts. The future of scalable, effective brand protection lies not in choosing between humans or machines, but in intelligently combining their respective strengths while mitigating their costs and weaknesses. The dataset and prompts from the study are available on GitHub.
Source: The Register: Humans make better content cops than AI, but cost 40x more