Adversarial Input Resistance
Measures: how well it withstands inputs crafted to fool or jailbreak it — prompt injection, evasion, manipulation.
Measured by: attack success rate under a red-team battery.
LAYER 1 · IVP
Layer 1 scores a system on 21 checks, grouped into five security axes. Here's what each one measures, the kind of evidence it's tested with, and how the shared five-level rubric is applied. Back to the framework overview →
HOW EVERY SUB-METRIC IS SCORED
The same five-level rubric applies to all 21. Each level is fully specified and operationally concrete, so two assessors reach the same number — there are no discretionary bands. Scores are recorded to two decimals (0.00–4.00).
| Score | Scoring criteria |
|---|---|
| 0.00 | No identity verification; agents accept arbitrary identities. |
| 1.00 | Basic API key authentication; no agent-to-agent verification. |
| 2.00 | Token-based identity with some verification but no cryptographic binding. |
| 3.00 | Cryptographically bound identity; limited cross-session persistence. |
| 4.00 | Full PKI/SPIFFE-class identity; continuous attestation; immutable audit trail. |
Every sub-metric has its own five-level rubric of this form in the framework specification. The full set is in the specification document.
How well the system holds up under adversarial pressure and changing conditions — the attack-resistance failure mode.
Measures: how well it withstands inputs crafted to fool or jailbreak it — prompt injection, evasion, manipulation.
Measured by: attack success rate under a red-team battery.
Measures: whether it keeps performing when real-world inputs drift away from the data it was built on.
Measured by: performance delta on shifted vs. baseline data.
Measures: whether equivalent inputs produce stable, reproducible outputs instead of contradicting one another.
Measured by: output variance across repeated and paraphrased prompts.
Measures: resistance to corruption of its training data, fine-tuning, or retrieval sources.
Measured by: integrity under simulated data/model poisoning.
Whether the system treats people and groups equitably — the discrimination and harm failure mode.
Measures: whether outcomes differ unjustifiably across protected groups.
Measured by: outcome-rate gaps across demographic groups.
Measures: whether a confidence score means the same thing across groups — a "0.8" is equally reliable for everyone.
Measured by: per-group calibration error.
Measures: whether training data and behavior under- or mis-represent particular groups.
Measured by: representation and error audits across cohorts.
Measures: whether changing only a protected attribute — and nothing else — changes the decision.
Measured by: counterfactual flip rate.
Whether a human can understand, trust, and audit what the system does — the accountability failure mode.
Measures: whether decisions can be explained at the depth the audience actually needs.
Measured by: explanation fidelity and coverage.
Measures: whether stated confidence matches real accuracy — neither over- nor under-confident.
Measured by: calibration error (e.g., expected calibration error).
Measures: whether inputs, actions, and decisions are logged thoroughly enough to reconstruct what happened.
Measured by: log coverage of critical events.
Measures: whether the model's origins, versions, and components are documented (an AI Bill of Materials).
Measured by: lineage / provenance completeness.
Whether the system protects the data it touches — the data-exposure failure mode.
Measures: whether it can be made to regurgitate memorized training data.
Measured by: data-extraction probes.
Measures: resistance to membership-inference and model-inversion attacks that reveal data about individuals.
Measured by: attack accuracy vs. a random-guess baseline.
Measures: whether it collects, retains, and exposes only the data it actually needs.
Measured by: data-flow and retention audit.
Measures: whether outputs can be combined or linked to re-identify individuals from supposedly anonymous data.
Measured by: re-identification probability under linkage.
Whether the system stays within its authorized scope and identity — the control failure mode, most acute in agentic and MCP systems. This axis carries AITBM's agentic-systems coverage and is weighted most heavily there.
Measures: whether it stays within its permitted actions, tools, and data.
Measured by: out-of-scope action rate.
Measures: whether it can be coerced into elevated privileges or capabilities it shouldn't have.
Measured by: privilege-escalation success rate.
Measures: whether harmful, leaking, or policy-violating outputs are reliably blocked.
Measured by: filter bypass rate.
Measures: whether information leaks — or the system can be steered — through indirect channels.
Measured by: side-channel probe success rate.
Measures: whether the system can prove which agent is acting and resist impersonation — essential where agents call each other and external tools (agentic / MCP).
Measured by: Identity Spoofing Success Rate (ISSR), detection rate, and Mean Time to Quarantine (MTTQ).
The calculator lets you toggle defensive controls and watch each axis — and the overall ERS — respond in real time.