The AITBM framework

Three layers, composed into a single Effective Risk Score — without losing the dimensions that matter.

IN PLAIN TERMS

AITBM scores an AI system in four moves: rate the system on 21 checks grouped into five areas (the IVP), adjust for how risky the deployment is — how autonomous it is, how exposed, how far a failure could spread (the ORP), discount for how stale or incomplete the evidence is (the ACI), then combine all three into a single 0–10 Effective Risk Score (ERS). The sections below define each layer — and every formula comes with a plain-language reading.

THE MODEL AT A GLANCE

Three lenses, combined into one score

Each card jumps to its section below. A strong result in one lens cannot hide a weak result in another.

LAYER 1 · THE SYSTEM

IVP

Intrinsic Vulnerability Profile

Asks: how strong is the system itself?

LAYER 2 · THE DEPLOYMENT

ORP

Operational Risk Posture

Asks: how risky is where and how it's used?

LAYER 3 · THE EVIDENCE

ACI

Assurance Confidence Index

Asks: how much can we trust what we know?

ERS

Effective Risk Score — one number, 0–10

Jump to the formula →

LAYER 1

IVP — Intrinsic Vulnerability Profile

21 sub-metrics across five axes. Each is scored 0–4 against a fully specified five-level rubric, so the score reflects the system, not the assessor. Architecture-specific weights apply — agentic and MCP systems weight Containment more heavily; RAG systems weight Privacy.

Axis 1 — Robustness (Ro)

Ro-1 Adversarial Input Resistance
Ro-2 Distribution Shift Resilience
Ro-3 Output Consistency
Ro-4 Poisoning Attack Resistance

Axis 2 — Fairness (Fa)

Fa-1 Demographic Parity
Fa-2 Calibration Consistency
Fa-3 Representation Bias
Fa-4 Counterfactual Fairness

Axis 3 — Transparency (Tr)

Tr-1 Explainability Depth
Tr-2 Confidence Calibration
Tr-3 Audit Trail Completeness
Tr-4 Model Lineage Disclosure

Axis 4 — Privacy (Pr)

Pr-1 Training Data Leakage Risk
Pr-2 Inference Attack Resistance
Pr-3 Data Minimization Compliance
Pr-4 Re-identification Risk

Axis 5 — Containment (Cn)

The axis carrying AITBM's agentic-systems coverage.

Cn-1 Scope Enforcement
Cn-2 Escalation Prevention
Cn-3 Output Filtering Robustness
Cn-4 Side-Channel Resistance
Cn-5 Agent Identity Integrity

Why these five axes?

The axes aren't a wish-list. Each one names a distinct category of AI failure that the others can't account for, drawn from the trust properties established in AI-assurance literature and the threat taxonomies AITBM maps to (MITRE ATLAS, OWASP, NIST AI RMF).

Robustness — does it hold up under adversarial or shifting inputs? (attack-resistance failures)
Fairness — does it treat groups equitably? (discrimination & harm failures)
Transparency — can a human understand and audit its decisions? (accountability failures)
Privacy — does it protect the data it touches? (data-exposure failures)
Containment — does it stay within its authorized scope and identity? (control failures, most acute in agentic systems)

They are kept separate on purpose: robustness, fairness, and accuracy provably cannot all be maximized at once, so collapsing them into one number would hide a trade-off the reviewer needs to see. See the math →

How each criterion earns its place

A candidate property only becomes one of the 21 sub-metrics if it passes four tests — which is why there are 21 and not 50.

1Measurable. A concrete test or signal produces evidence — not an assessor's opinion.
2Independently scorable. It maps cleanly onto five distinguishable, operationally concrete 0–4 levels.
3Non-overlapping. It isn't already captured by another sub-metric, so nothing is double-counted.
4Threat-anchored. It maps to a documented threat or control in a standard AITBM aligns to.

Each axis then carries an architecture-specific weight — agentic and MCP systems weight Containment most heavily, RAG systems weight Privacy — so the same 21 criteria adapt to what actually matters for the system in front of you.

Five-level scoring rubrics

Every sub-metric is scored against five fully specified levels. There are no partial rubrics and no discretionary bands — the criteria are operationally concrete so two assessors land on the same number. Scores are recorded to two decimals (0.00–4.00).

0.00 — absent 1.00 — minimal 2.00 — partial 3.00 — substantial 4.00 — comprehensive

LAYER 1 · GO DEEPER

Every one of the 21 sub-metrics, explained

See what each metric across all five axes measures and how it's tested — plus how the shared five-level rubric is applied, with a fully worked example.

Open the sub-metrics reference

LAYER 2

ORP — Operational Risk Posture

Intrinsic security cannot tell a high-stakes deployment from a trivial one. ORP captures the deployment context through four dimensions, producing a Compound Risk Multiplier (CRM).

Aa — Autonomy Amplification

Independent decision authority

As — Attack Surface Exposure

Exposure to untrusted inputs

Cp — Cascade Potential

Maximum downstream impact

Rf — Remediation Feasibility

Difficulty of fixing once found

CRM step table: 1.00 / 1.15 / 1.35 / 1.60 by the count of simultaneously elevated dimensions (score > 0.75), with 1.75 as the absolute framework cap. A system that is at once highly autonomous, highly exposed, and on a critical cascade path compounds beyond what a weighted sum captures — and operational risk can never be fully nullified by intrinsic strength.

LAYER 3

ACI — Assurance Confidence Index

An assessment is only as trustworthy as its evidence is fresh. The ACI applies a Beta-Binomial-informed temporal decay, with tier-specific decay rates: high-risk systems decay faster and re-assess sooner.

Decay model

ACI(t) = ACI₀ × e^(−λt)

λ is tier-specific — larger for Tier I (high-risk) systems, smaller for Tier III.

Re-assessment triggers

ACI < 0.70 — Warning: schedule re-assessment.
ACI < 0.50 — Critical: evidence materially stale.
ACI < 0.30 — Invalid: score no longer defensible.

ERS — the Effective Risk Score

The three layers compose into a single 0–10 score, with a residual risk floor that cannot be zeroed out.

ERS = α + (1 − α) × f(IVP, ORP, ACI)      where α = 0.15

Why α = 0.15?

α (alpha) is the residual-risk floor — the share of risk that remains even when every control is in place and the system scores perfectly on intrinsic security. In the formula it pins the bottom of the scale: the α term is the risk that can't be removed, and the (1 − α) term is the portion that strong controls can actually reduce. Setting α > 0 is what stops a confident assessment from ever claiming zero risk.

It reflects reality

AI systems are probabilistic and can behave emergently. Some irreducible risk always survives — a zero would be a false guarantee.

0.15 is calibrated, not arbitrary

The value is fixed against the Finbot validation case so the worked example reproduces its known result — and it stays consistent across every assessment.

It's a deliberate default

~15% residual risk is a conservative, defensible floor: high enough to be honest about uncertainty, low enough that real mitigation still moves the score. It's earmarked for ongoing sensitivity validation.

A worked illustration: a system with perfect intrinsic security still floors at α — about 1.5 on the 0–10 scale — never 0.

ERS ≥ 9.0

High-Risk → Tier I assessment (full depth, aggressive decay).

ERS 6.0 – 8.9

Medium-Risk → Tier II assessment (abbreviated IVP, standard decay).

ERS < 6.0

Lower-Risk → Tier III assessment (simplified scoring, relaxed decay).

Tiered pathways match assessment depth to system risk. Tier I covers all 21 sub-metrics; Tier II evaluates at the axis level; Tier III uses a simplified profile for internal, lower-stakes systems.

WHY THESE FORMULAS

Formula basis & variables

Every score in AITBM is computed, not judged. Each formula below is paired with a plain-language reading of what it does and what each variable means, plus the design reasoning for its shape — why a geometric mean here, a hard floor there, a conservative minimum somewhere else. Expand Why this form for the deeper mathematical justification.

Axis score — weighted mean

Layer 1 · IVP

Axis = Σ(wᵢ × SubMetricᵢ) / Σ(wᵢ)

Basis. A weighted average of an axis's sub-metric scores. Weights reflect what matters for the architecture, and dividing by the sum of weights redistributes any non-applicable sub-metric instead of dragging the axis toward zero.

Variables. SubMetricᵢ sub-metric score (0–1); wᵢ its fixed weight for the architecture class (LLM / Classifier / Agentic); denominator the sum of active weights (axis weights total 1.00).

Why this form

A weighted arithmetic mean fits sub-metrics on a shared 0–1 scale where strength in one can partly offset weakness in another within the axis; that intra-axis compensation is acceptable, which is precisely why the five axes themselves are never averaged together. Normalising by Σwᵢ keeps the score architecture-invariant — removing a non-applicable sub-metric never changes the achievable maximum — and weights fixed by tier and architecture (not chosen by the assessor) make two assessors compute the same number.

IVP — output vector

Layer 1 · IVP

IVP = (Ro, Fa, Tr, Pr, Cn)

Basis. The Layer-1 result is a five-number profile, read like a radar chart — not a single score.

Variables. Ro, Fa, Tr, Pr, Cn the five weighted axis scores (0–1; higher = stronger intrinsic assurance).

Why this form

Accuracy, adversarial robustness, and fairness provably cannot be maximised at once, so any scalar that fuses the axes necessarily hides a trade-off the reader needs to see. Keeping IVP a vector preserves that signal; a single number is derived only when unavoidable, as the architecture-weighted projection W_ivp · IVP inside the ERS — and even then the vector stays primary.

Compound Risk Multiplier

Layer 2 · ORP

N_elevated = count{ Aa, As, Cp, Rf : score > 0.75 }
CRM:  2 → 1.15   3 → 1.35   4 → 1.60

Basis. Count how many operational dimensions are simultaneously elevated (above 0.75); the more there are, the more risk compounds beyond what a simple average shows. Two or more also raise a mandatory Compound Risk Alert.

Variables. Aa, As, Cp, Rf the four ORP dimensions; 0.75 the elevated threshold; N_elevated the count (0–4) → CRM (1.00–1.60, 1.75 absolute cap).

Why this form

A weighted sum is additive and treats risks as substitutable; counting isolates the interaction effect — not how high any one dimension is (the weighted sum already captures that) but how many are high at once, which is what compounds. The CRM is a bounded, super-additive step function of the count, published as a table so the correction stays auditable and reproducible.

Effective operational risk

Layer 2 · ORP

ORP_effective = (W_orp · ORP) × CRM

Basis. Combine the four dimensions under the tier weights, then amplify by the compounding multiplier — the single operational score the ERS consumes.

Variables. W_orp · ORP the tier-weighted sum of the four dimensions; CRM the compound multiplier.

Why this form

Separating the linear part (the weighted sum) from the interaction part (the multiplier) keeps each interpretable: the sum answers how much operational risk on average, the multiplier how strongly those risks reinforce one another. Applying CRM multiplicatively makes it a proportional surcharge that means the same thing whether the base score is high or low.

Evaluation Coverage

Layer 3 · ACI

Ec = Base_Coverage × Independence × Fidelity

Basis. Evaluation quality is the product of how much was tested, how independent the tester was, and how production-like the environment was — a weakness in any one drags the whole score down.

Variables. Base_Coverage fraction of applicable sub-metrics tested; Independence 0.60 self / 0.80 internal / 1.00 external; Fidelity 0.70 dev / 0.85 staging / 0.95 verified / 1.00 production.

Why this form

A product encodes necessity, not trade-off: self-assessment (0.60) caps Ec at 0.60 even with full coverage in production. It is the same weakest-link logic the ACI geometric mean uses one level up, applied here to the inputs of a single component — and the multipliers are discrete, evidence-anchored levels, so the result is reproducible rather than a judgment call.

Temporal Freshness

Layer 3 · ACI

Tf = min(T_calendar, C_event, C_monitor, C_evidence)
T_calendar = e^(−λ_eff · Δt_days)

Basis. Freshness is the most pessimistic of a calendar decay term and three event ceilings — a base-model swap or a monitoring blackout caps it regardless of the calendar. The lowest cap wins.

Variables. T_calendar time decay; C_event / C_monitor / C_evidence caps for change events, monitoring gaps, unresolved drift; λ_eff = λ_tier × M_TDI × M_threat; Δt_days days since sign-off.

Why this form

The caps are independent invalidating conditions, so min() is the conservative, non-compensatory combinator — a strong calendar score cannot paper over a model swap. Exponential decay gives a constant, interpretable half-life per tier (Tier 1 ≈ 30 days … Tier 4 ≈ 365 days), and a multiplicative λ_eff lets drift and threat accelerate ageing proportionally — a system drifting under active threat ages far faster.

Time Drift Index

Layer 3 · ACI

TDI = 0.25·CSD + 0.30·BOD + 0.20·DRD + 0.15·TCD + 0.10·MGD

Basis. One 0–1 number for how far the system has drifted from what was assessed, weighting observed behavioural change highest and monitoring gaps lowest. It feeds the modifier that accelerates freshness decay.

Variables. CSD config, BOD behaviour, DRD data/retrieval, TCD threat/control, MGD monitoring gap (each 0–1).

Why this form

A weighted sum is right here — unlike the CRM — because small drifts across several categories should accumulate into a moderate index. Fixed weights summing to 1.00 keep it on [0, 1] and reproducible, and the ordering (BOD > CSD > DRD > TCD > MGD) encodes that observed behavioural change is the strongest evidence an assessment is stale, while a monitoring gap is a weaker, indirect signal.

ACI — composite confidence

Layer 3 · ACI

ACI = (Pc × Ec × Tf)^(1/3)

Basis. Assurance is the geometric mean of the three evidence dimensions — if any one is near zero, overall confidence is near zero. Unknown provenance cannot be out-tested.

Variables. Pc provenance, Ec evaluation coverage, Tf freshness (each 0–1). Weights are fixed at ⅓ (general form Pc^w·Ec^w·Tf^w, Σw = 1).

Why this form

The geometric mean is correct for jointly necessary, non-substitutable prerequisites: it is zero whenever any factor is zero, it rewards a balanced profile over a lopsided one, and on [0, 1] it always sits at or below the arithmetic mean — the conservative choice. Equal, tier-independent weights are deliberate: deployment criticality is already carried inside Tf (its decay constant and caps), so re-weighting by tier would double-count it and break cross-assessment comparability.

ERS — the Effective Risk Score

Composite

ERS = min(10,  ORP_eff × [ α + (1−α)(1 − W_ivp·IVP) ] × (1 / ACI) × S )
α = 0.15      S = 10

Basis. Start from operational risk, reduce it by intrinsic security but only down to a floor (never to zero), inflate it for low assurance confidence, scale to 0–10, and cap at 10. ORP sets the stakes, IVP mitigates, ACI adjusts for how much is actually known.

Variables. ORP_eff operational score; W_ivp·IVP intrinsic-security scalar (0–1); α = 0.15 residual floor; 1/ACI epistemic inflation; S = 10 scale; min(·,10) hard cap.

Why this form

The bracket α + (1−α)(1 − W_ivp·IVP) maps intrinsic security onto the [α, 1] interval — 1.0 at zero intrinsic security (full risk), α = 0.15 at perfect (15% irreducible) — encoding that intrinsic quality reduces but cannot eliminate operational risk; a zero would be a false guarantee for a probabilistic system. Dividing by ACI makes opacity inflate the score proportionally rather than be ignored. The hard cap keeps ERS bounded and stops a near-zero ACI from blowing up — which signals "assessment invalid, refresh it", not "infinitely risky". α = 0.15 is calibrated against the Finbot anchor and earmarked for sensitivity validation.

Behavioral Baseline Deviation

Monitoring

BBD(t) = D_JS( P_baseline ‖ P_current(t) )

Basis. The statistical distance between how the system behaved when assessed and how it behaves now. Crossing thresholds (0.15 / 0.35 / 0.60) escalates from alert to mandatory reassessment to automated quarantine.

Variables. P_baseline behaviour at assessment (decision patterns, tool-call targets, output intent, memory access); P_current(t) the same distributions now; D_JS the Jensen–Shannon divergence.

Why this form

Jensen–Shannon divergence is chosen over Kullback–Leibler for three reasons: it is symmetric (neither distribution is privileged as "true"), it is bounded on [0, 1] so fixed thresholds stay stable and comparable across systems, and it stays finite even when one distribution assigns zero probability to an event the other allows — common when a new tool-call target or novel output appears, where KL divergence would diverge to infinity.