Sovereign Moderation
Content Safety Without Data Leaving
Most AI platforms send your content to OpenAI for moderation. OpenAI provides this for free, deliberately — they want it adopted as the industry standard. The trade-off: your data leaves your infrastructure. Sovereign moderation eliminates that trade-off.
Why it exists
LLM providers like Anthropic and OpenAI already include built-in safety filters — RLHF training, real-time classifiers, output moderation. Your AI agents are not running unfiltered models.
On top of model-level safety, OpenAI offers a free Moderation API — a separate classification endpoint with no per-call charge for API users. Content sent to the API is not used for training (since March 2023), but is retained in abuse monitoring logs for up to 30 days by default.
For many organizations, combining built-in LLM filters with OpenAI's Moderation API is sufficient. But for EU enterprises under GDPR, regulated industries, or any organization with policies that prohibit sending content to external services — even for classification — this creates a compliance gap. Sovereign moderation eliminates it.
MeetLoyd's sovereign mode runs classification entirely on your infrastructure using a BERT-based model (Detoxify) — no external network calls, no GPU required, near-zero marginal cost. It also adds what vendor-built filters don't provide: configurable thresholds per governance pack, a full audit trail with per-category scores, consistent policy enforcement across all LLM providers, and complete transparency for compliance teams.
How it works
MeetLoyd offers two moderation modes:
- Standard — OpenAI Moderation API. Free, proven, content leaves your infrastructure.
- Sovereign — Detoxify BERT classifier. Self-hosted, nothing leaves, category-aware thresholds tuned for business content.
Sovereign mode evaluates six content categories independently: toxicity, severe toxicity, obscene language, threats, insults, and identity attacks. Thresholds are set higher for categories that commonly trigger on legitimate business language — sales negotiations, legal terminology, medical discussions — to minimize false positives without compromising safety.
Every moderation decision is logged with per-category scores, creating an audit trail for compliance teams to review post-hoc — no real-time human review required.
LLM escalation (optional)
For borderline content (detected by Detoxify but below the block threshold), enterprises can enable LLM escalation. Llama Guard 3 re-classifies the content on self-hosted vLLM infrastructure, providing contextual understanding and natural language reasoning. If all LLM endpoints fail, the system gracefully falls back to Detoxify-only — business is never interrupted because LLM infrastructure is down.
How MeetLoyd implements sovereign moderation
- Standard mode — OpenAI Moderation API, free for all tiers. Content leaves infrastructure.
- Sovereign mode — Detoxify BERT on CPU. Zero external calls. ~10ms latency. 6 category-aware thresholds.
- LLM escalation — Llama Guard 3 on vLLM. Borderline re-classification with reasoning. Token-metered from prepaid account.
- Fallback chain — Tenant-configurable: primary vLLM → fallback vLLM → fallback LLM → Detoxify-only + audit.
- EU AI Act — Both modes satisfy Article 14. Sovereign + LLM adds explainable safety decisions.
- Near-zero carbon — CPU inference at ~0.1W. No GPU required for base sovereign mode.