PVP
Probabilistic Verification Protocol
Mathematical compliance verification for AI agents. Multi-LLM cross-checking with Wilson Score confidence intervals. Four tiers of increasing rigor -- from self-critique to full adversarial consensus. Patent pending.
What is PVP?
AI agents make mistakes. In regulated industries -- finance, healthcare, legal -- a mistake isn't just an inconvenience, it's a compliance violation. PVP (Probabilistic Verification Protocol) addresses this by applying mathematical verification to AI outputs before they reach the real world.
Instead of trusting a single LLM's output, PVP uses multiple LLMs as independent judges, then applies statistical methods (Wilson Score confidence intervals) to determine whether the output meets compliance requirements. The result isn't a binary "pass/fail" but a confidence score with a lower bound -- the probability that the output is compliant, accounting for sample size and agreement variance.
The four tiers
PVP scales verification rigor based on the risk level of the operation. A routine email draft gets Tier 1. A financial trade recommendation gets Tier 4.
Wilson Score confidence
PVP doesn't use simple majority voting. It uses Wilson Score confidence intervals -- the same statistical method used in A/B testing and medical trials. This matters because:
- Small sample correction -- With only 2-3 judges (LLMs), naive percentages are misleading. Wilson Score accounts for small sample sizes.
- Lower bound guarantee -- PVP reports the lower bound of the 95% confidence interval. This is the worst-case probability of compliance, not the average.
- Quantified uncertainty -- Instead of "2 out of 3 judges agree (67%)", PVP reports "Wilson lower bound: 0.22" -- meaning we're 95% confident compliance probability is at least 22%. This honest uncertainty drives the escalation decision.
Why multi-LLM?
A single LLM has systematic biases. Claude might be overly cautious about legal risk. GPT might miss European regulatory nuances. By using models from different providers (Anthropic, OpenAI, Google) as independent judges, PVP diversifies the failure modes. If two models from different providers with different training data both agree an output is compliant, the probability of a false positive drops significantly.
Provider independence
PVP requires judges from different providers, not just different models. Claude Sonnet judging Claude Opus isn't independent verification -- they share training methodology and biases. Claude judging GPT-4o is independent. This is analogous to the "Big 4" audit principle: you don't audit yourself.
How MeetLoyd implements PVP
PVP is production-deployed on MeetLoyd and deeply integrated with the governance system:
- Automatic tier selection -- Governance Packs set the PVP tier per action type. HIPAA pack sets Tier 3 for patient data decisions. SOX pack sets Tier 4 for financial approvals.
- Wilson Score thresholds -- Configurable per pack. HIPAA default: lower bound must exceed 0.85. SOX default: must exceed 0.90. Below threshold triggers automatic human escalation.
- BYOK-compatible -- PVP works with your LLM keys. Your Anthropic key for the primary agent, your OpenAI key for the judge. No data leaves your key perimeter.
- Audit integration -- Every PVP verification is logged with the full judge panel, individual scores, Wilson calculation, and final decision. SOX-grade audit trail.
- Cost-aware -- Tier 1 adds ~2% cost (same model re-evaluation). Tier 4 adds ~150% cost (3 independent models). Costs are tracked per verification and visible in the Compliance Cockpit.
PVP vs simple guardrails
Most AI platforms offer keyword-based guardrails: "don't say X", "block topic Y." These are brittle, easily bypassed, and produce false positives. PVP is fundamentally different:
- Guardrails -- Pattern matching. Blocks "credit card number" but misses "the sixteen digits on the front of the card."
- PVP -- Semantic verification. An independent LLM evaluates whether the output violates compliance requirements in meaning, not just keyword pattern.