Axis 1 — Robustness (Ro)
- Ro-1 Adversarial Input Resistance
- Ro-2 Distribution Shift Resilience
- Ro-3 Output Consistency
- Ro-4 Poisoning Attack Resistance
Three layers, composed into a single Effective Risk Score — without losing the dimensions that matter.
IN PLAIN TERMS
AITBM scores an AI system in four moves: rate the system on 21 checks grouped into five areas (the IVP), adjust for how risky the deployment is — how autonomous it is, how exposed, how far a failure could spread (the ORP), discount for how stale or incomplete the evidence is (the ACI), then combine all three into a single 0–10 Effective Risk Score (ERS). The sections below define each layer — and every formula comes with a plain-language reading.
Each card jumps to its section below. A strong result in one lens cannot hide a weak result in another.
Asks: how strong is the system itself?
LAYER 2 · THE DEPLOYMENTAsks: how risky is where and how it's used?
LAYER 3 · THE EVIDENCEAsks: how much can we trust what we know?
LAYER 1
21 sub-metrics across five axes. Each is scored 0–4 against a fully specified five-level rubric, so the score reflects the system, not the assessor. Architecture-specific weights apply — agentic and MCP systems weight Containment more heavily; RAG systems weight Privacy.
The axis carrying AITBM's agentic-systems coverage.
The axes aren't a wish-list. Each one names a distinct category of AI failure that the others can't account for, drawn from the trust properties established in AI-assurance literature and the threat taxonomies AITBM maps to (MITRE ATLAS, OWASP, NIST AI RMF).
They are kept separate on purpose: robustness, fairness, and accuracy provably cannot all be maximized at once, so collapsing them into one number would hide a trade-off the reviewer needs to see. See the math →
A candidate property only becomes one of the 21 sub-metrics if it passes four tests — which is why there are 21 and not 50.
Each axis then carries an architecture-specific weight — agentic and MCP systems weight Containment most heavily, RAG systems weight Privacy — so the same 21 criteria adapt to what actually matters for the system in front of you.
Every sub-metric is scored against five fully specified levels. There are no partial rubrics and no discretionary bands — the criteria are operationally concrete so two assessors land on the same number. Scores are recorded to two decimals (0.00–4.00).
See what each metric across all five axes measures and how it's tested — plus how the shared five-level rubric is applied, with a fully worked example.
LAYER 2
Intrinsic security cannot tell a high-stakes deployment from a trivial one. ORP captures the deployment context through four dimensions, producing a Compound Risk Multiplier (CRM).
Independent decision authority
Exposure to untrusted inputs
Maximum downstream impact
Difficulty of fixing once found
CRM step table: 1.00 / 1.15 / 1.35 / 1.60 by the count of simultaneously elevated dimensions (score > 0.75), with 1.75 as the absolute framework cap. A system that is at once highly autonomous, highly exposed, and on a critical cascade path compounds beyond what a weighted sum captures — and operational risk can never be fully nullified by intrinsic strength.
LAYER 3
An assessment is only as trustworthy as its evidence is fresh. The ACI applies a Beta-Binomial-informed temporal decay, with tier-specific decay rates: high-risk systems decay faster and re-assess sooner.
Decay model
ACI(t) = ACI₀ × e^(−λt)
λ is tier-specific — larger for Tier I (high-risk) systems, smaller for Tier III.
Re-assessment triggers
The three layers compose into a single 0–10 score, with a residual risk floor that cannot be zeroed out.
ERS = α + (1 − α) × f(IVP, ORP, ACI) where α = 0.15
α (alpha) is the residual-risk floor — the share of risk that remains even when every control is in place and the system scores perfectly on intrinsic security. In the formula it pins the bottom of the scale: the α term is the risk that can't be removed, and the (1 − α) term is the portion that strong controls can actually reduce. Setting α > 0 is what stops a confident assessment from ever claiming zero risk.
It reflects reality
AI systems are probabilistic and can behave emergently. Some irreducible risk always survives — a zero would be a false guarantee.
0.15 is calibrated, not arbitrary
The value is fixed against the Finbot validation case so the worked example reproduces its known result — and it stays consistent across every assessment.
It's a deliberate default
~15% residual risk is a conservative, defensible floor: high enough to be honest about uncertainty, low enough that real mitigation still moves the score. It's earmarked for ongoing sensitivity validation.
A worked illustration: a system with perfect intrinsic security still floors at α — about 1.5 on the 0–10 scale — never 0.
ERS ≥ 9.0
High-Risk → Tier I assessment (full depth, aggressive decay).
ERS 6.0 – 8.9
Medium-Risk → Tier II assessment (abbreviated IVP, standard decay).
ERS < 6.0
Lower-Risk → Tier III assessment (simplified scoring, relaxed decay).
Tiered pathways match assessment depth to system risk. Tier I covers all 21 sub-metrics; Tier II evaluates at the axis level; Tier III uses a simplified profile for internal, lower-stakes systems.
WHY THESE FORMULAS
Every score in AITBM is computed, not judged. Each formula below is paired with a plain-language reading of what it does and what each variable means, plus the design reasoning for its shape — why a geometric mean here, a hard floor there, a conservative minimum somewhere else. Expand Why this form for the deeper mathematical justification.
Axis = Σ(wᵢ × SubMetricᵢ) / Σ(wᵢ)
Basis. A weighted average of an axis's sub-metric scores. Weights reflect what matters for the architecture, and dividing by the sum of weights redistributes any non-applicable sub-metric instead of dragging the axis toward zero.
Variables. SubMetricᵢ sub-metric score (0–1); wᵢ its fixed weight for the architecture class (LLM / Classifier / Agentic); denominator the sum of active weights (axis weights total 1.00).
A weighted arithmetic mean fits sub-metrics on a shared 0–1 scale where strength in one can partly offset weakness in another within the axis; that intra-axis compensation is acceptable, which is precisely why the five axes themselves are never averaged together. Normalising by Σwᵢ keeps the score architecture-invariant — removing a non-applicable sub-metric never changes the achievable maximum — and weights fixed by tier and architecture (not chosen by the assessor) make two assessors compute the same number.
IVP = (Ro, Fa, Tr, Pr, Cn)
Basis. The Layer-1 result is a five-number profile, read like a radar chart — not a single score.
Variables. Ro, Fa, Tr, Pr, Cn the five weighted axis scores (0–1; higher = stronger intrinsic assurance).
Accuracy, adversarial robustness, and fairness provably cannot be maximised at once, so any scalar that fuses the axes necessarily hides a trade-off the reader needs to see. Keeping IVP a vector preserves that signal; a single number is derived only when unavoidable, as the architecture-weighted projection W_ivp · IVP inside the ERS — and even then the vector stays primary.
N_elevated = count{ Aa, As, Cp, Rf : score > 0.75 }
CRM: 2 → 1.15 3 → 1.35 4 → 1.60
Basis. Count how many operational dimensions are simultaneously elevated (above 0.75); the more there are, the more risk compounds beyond what a simple average shows. Two or more also raise a mandatory Compound Risk Alert.
Variables. Aa, As, Cp, Rf the four ORP dimensions; 0.75 the elevated threshold; N_elevated the count (0–4) → CRM (1.00–1.60, 1.75 absolute cap).
A weighted sum is additive and treats risks as substitutable; counting isolates the interaction effect — not how high any one dimension is (the weighted sum already captures that) but how many are high at once, which is what compounds. The CRM is a bounded, super-additive step function of the count, published as a table so the correction stays auditable and reproducible.
ORP_effective = (W_orp · ORP) × CRM
Basis. Combine the four dimensions under the tier weights, then amplify by the compounding multiplier — the single operational score the ERS consumes.
Variables. W_orp · ORP the tier-weighted sum of the four dimensions; CRM the compound multiplier.
Separating the linear part (the weighted sum) from the interaction part (the multiplier) keeps each interpretable: the sum answers how much operational risk on average, the multiplier how strongly those risks reinforce one another. Applying CRM multiplicatively makes it a proportional surcharge that means the same thing whether the base score is high or low.
Ec = Base_Coverage × Independence × Fidelity
Basis. Evaluation quality is the product of how much was tested, how independent the tester was, and how production-like the environment was — a weakness in any one drags the whole score down.
Variables. Base_Coverage fraction of applicable sub-metrics tested; Independence 0.60 self / 0.80 internal / 1.00 external; Fidelity 0.70 dev / 0.85 staging / 0.95 verified / 1.00 production.
A product encodes necessity, not trade-off: self-assessment (0.60) caps Ec at 0.60 even with full coverage in production. It is the same weakest-link logic the ACI geometric mean uses one level up, applied here to the inputs of a single component — and the multipliers are discrete, evidence-anchored levels, so the result is reproducible rather than a judgment call.
Tf = min(T_calendar, C_event, C_monitor, C_evidence) T_calendar = e^(−λ_eff · Δt_days)
Basis. Freshness is the most pessimistic of a calendar decay term and three event ceilings — a base-model swap or a monitoring blackout caps it regardless of the calendar. The lowest cap wins.
Variables. T_calendar time decay; C_event / C_monitor / C_evidence caps for change events, monitoring gaps, unresolved drift; λ_eff = λ_tier × M_TDI × M_threat; Δt_days days since sign-off.
The caps are independent invalidating conditions, so min() is the conservative, non-compensatory combinator — a strong calendar score cannot paper over a model swap. Exponential decay gives a constant, interpretable half-life per tier (Tier 1 ≈ 30 days … Tier 4 ≈ 365 days), and a multiplicative λ_eff lets drift and threat accelerate ageing proportionally — a system drifting under active threat ages far faster.
TDI = 0.25·CSD + 0.30·BOD + 0.20·DRD + 0.15·TCD + 0.10·MGD
Basis. One 0–1 number for how far the system has drifted from what was assessed, weighting observed behavioural change highest and monitoring gaps lowest. It feeds the modifier that accelerates freshness decay.
Variables. CSD config, BOD behaviour, DRD data/retrieval, TCD threat/control, MGD monitoring gap (each 0–1).
A weighted sum is right here — unlike the CRM — because small drifts across several categories should accumulate into a moderate index. Fixed weights summing to 1.00 keep it on [0, 1] and reproducible, and the ordering (BOD > CSD > DRD > TCD > MGD) encodes that observed behavioural change is the strongest evidence an assessment is stale, while a monitoring gap is a weaker, indirect signal.
ACI = (Pc × Ec × Tf)^(1/3)
Basis. Assurance is the geometric mean of the three evidence dimensions — if any one is near zero, overall confidence is near zero. Unknown provenance cannot be out-tested.
Variables. Pc provenance, Ec evaluation coverage, Tf freshness (each 0–1). Weights are fixed at ⅓ (general form Pc^w·Ec^w·Tf^w, Σw = 1).
The geometric mean is correct for jointly necessary, non-substitutable prerequisites: it is zero whenever any factor is zero, it rewards a balanced profile over a lopsided one, and on [0, 1] it always sits at or below the arithmetic mean — the conservative choice. Equal, tier-independent weights are deliberate: deployment criticality is already carried inside Tf (its decay constant and caps), so re-weighting by tier would double-count it and break cross-assessment comparability.
ERS = min(10, ORP_eff × [ α + (1−α)(1 − W_ivp·IVP) ] × (1 / ACI) × S ) α = 0.15 S = 10
Basis. Start from operational risk, reduce it by intrinsic security but only down to a floor (never to zero), inflate it for low assurance confidence, scale to 0–10, and cap at 10. ORP sets the stakes, IVP mitigates, ACI adjusts for how much is actually known.
Variables. ORP_eff operational score; W_ivp·IVP intrinsic-security scalar (0–1); α = 0.15 residual floor; 1/ACI epistemic inflation; S = 10 scale; min(·,10) hard cap.
The bracket α + (1−α)(1 − W_ivp·IVP) maps intrinsic security onto the [α, 1] interval — 1.0 at zero intrinsic security (full risk), α = 0.15 at perfect (15% irreducible) — encoding that intrinsic quality reduces but cannot eliminate operational risk; a zero would be a false guarantee for a probabilistic system. Dividing by ACI makes opacity inflate the score proportionally rather than be ignored. The hard cap keeps ERS bounded and stops a near-zero ACI from blowing up — which signals "assessment invalid, refresh it", not "infinitely risky". α = 0.15 is calibrated against the Finbot anchor and earmarked for sensitivity validation.
BBD(t) = D_JS( P_baseline ‖ P_current(t) )
Basis. The statistical distance between how the system behaved when assessed and how it behaves now. Crossing thresholds (0.15 / 0.35 / 0.60) escalates from alert to mandatory reassessment to automated quarantine.
Variables. P_baseline behaviour at assessment (decision patterns, tool-call targets, output intent, memory access); P_current(t) the same distributions now; D_JS the Jensen–Shannon divergence.
Jensen–Shannon divergence is chosen over Kullback–Leibler for three reasons: it is symmetric (neither distribution is privileged as "true"), it is bounded on [0, 1] so fixed thresholds stay stable and comparable across systems, and it stays finite even when one distribution assigns zero probability to an event the other allows — common when a new tool-call target or novel output appears, where KL divergence would diverge to infinity.