PVP
Probabilistic Verification Protocol

Mathematical compliance verification for AI agents. Multi-LLM cross-checking with Wilson Score confidence intervals. Four tiers of increasing rigor -- from self-critique to full adversarial consensus. Patent pending.

Verification MeetLoyd Patent Pending Multi-LLM

What is PVP?

AI agents make mistakes. In regulated industries -- finance, healthcare, legal -- a mistake isn't just an inconvenience, it's a compliance violation. PVP (Probabilistic Verification Protocol) addresses this by applying mathematical verification to AI outputs before they reach the real world.

Instead of trusting a single LLM's output, PVP uses multiple LLMs as independent judges, then applies statistical methods (Wilson Score confidence intervals) to determine whether the output meets compliance requirements. The result isn't a binary "pass/fail" but a confidence score with a lower bound -- the probability that the output is compliant, accounting for sample size and agreement variance.

The four tiers

PVP scales verification rigor based on the risk level of the operation. A routine email draft gets Tier 1. A financial trade recommendation gets Tier 4.

PVP Verification Tiers (increasing rigor)

Full Consensus

3+ LLMs from different providers must agree. Unanimous or escalate to human. For irreversible financial/medical decisions.

Adversarial Debate

LLM-A proposes, LLM-B attacks. Iterated until consensus or human escalation. For high-risk compliance decisions.

Dual Judge

Two LLMs from different providers independently evaluate. Disagreement triggers human review. For medium-risk operations.

Self-Critique

Same LLM reviews its own output with a verification prompt. Low cost, catches obvious errors. For routine operations.

Lower risk / lower cost Higher risk / higher rigor

Wilson Score confidence

PVP doesn't use simple majority voting. It uses Wilson Score confidence intervals -- the same statistical method used in A/B testing and medical trials. This matters because:

Small sample correction -- With only 2-3 judges (LLMs), naive percentages are misleading. Wilson Score accounts for small sample sizes.
Lower bound guarantee -- PVP reports the lower bound of the 95% confidence interval. This is the worst-case probability of compliance, not the average.
Quantified uncertainty -- Instead of "2 out of 3 judges agree (67%)", PVP reports "Wilson lower bound: 0.22" -- meaning we're 95% confident compliance probability is at least 22%. This honest uncertainty drives the escalation decision.

Why multi-LLM?

A single LLM has systematic biases. Claude might be overly cautious about legal risk. GPT might miss European regulatory nuances. By using models from different providers (Anthropic, OpenAI, Google) as independent judges, PVP diversifies the failure modes. If two models from different providers with different training data both agree an output is compliant, the probability of a false positive drops significantly.

Provider independence

PVP requires judges from different providers, not just different models. Claude Sonnet judging Claude Opus isn't independent verification -- they share training methodology and biases. Claude judging GPT-4o is independent. This is analogous to the "Big 4" audit principle: you don't audit yourself.

How MeetLoyd implements PVP

PVP is production-deployed on MeetLoyd and deeply integrated with the governance system:

Automatic tier selection -- Governance Packs set the PVP tier per action type. HIPAA pack sets Tier 3 for patient data decisions. SOX pack sets Tier 4 for financial approvals.
Wilson Score thresholds -- Configurable per pack. HIPAA default: lower bound must exceed 0.85. SOX default: must exceed 0.90. Below threshold triggers automatic human escalation.
BYOK-compatible -- PVP works with your LLM keys. Your Anthropic key for the primary agent, your OpenAI key for the judge. No data leaves your key perimeter.
Audit integration -- Every PVP verification is logged with the full judge panel, individual scores, Wilson calculation, and final decision. SOX-grade audit trail.
Cost-aware -- Tier 1 adds ~2% cost (same model re-evaluation). Tier 4 adds ~150% cost (3 independent models). Costs are tracked per verification and visible in the Compliance Cockpit.

See the full platform -->

PVP vs simple guardrails

Most AI platforms offer keyword-based guardrails: "don't say X", "block topic Y." These are brittle, easily bypassed, and produce false positives. PVP is fundamentally different:

Guardrails -- Pattern matching. Blocks "credit card number" but misses "the sixteen digits on the front of the card."
PVP -- Semantic verification. An independent LLM evaluates whether the output violates compliance requirements in meaning, not just keyword pattern.

PVP
Probabilistic Verification Protocol

What is PVP?

The four tiers

Wilson Score confidence

Why multi-LLM?

Provider independence

How MeetLoyd implements PVP

PVP vs simple guardrails

Related terms

Mathematical verification.
Not vibes-based guardrails. That's MeetLoyd.

PVPProbabilistic Verification Protocol

What is PVP?

The four tiers

Wilson Score confidence

Why multi-LLM?

Provider independence

How MeetLoyd implements PVP

PVP vs simple guardrails

Related terms

Governance Packs

SPIFFE

Mathematical verification.Not vibes-based guardrails. That's MeetLoyd.

PVP
Probabilistic Verification Protocol

Mathematical verification.
Not vibes-based guardrails. That's MeetLoyd.