World Business StrategiesServing the Global Financial Community since 2000

Introduction, Requirements, and Use Cases

Week 1: Tuesday 24th March
Introduction to AI model risk

  • AI model risk as a new discipline
    • Rapid adoption of AI in financial services
    • Explainability and transparency requirements
    • Principles of quantitative measurement and reporting of AI model risk based on rigorous statistical tests
  • Changes compared to traditional risk management
    • Conventional model risk management (MRM) vs. AI model management
      • Periodic validation vs. continuous assurance
      • Traditional backtesting vs. new validation techniques for LLMs
      • The evolving requirements for formal reporting
    • Conventional operational risk (OpRisk) vs. AI model risk
      • Challenges of adopting OpRisk metrics for AI model risk
      • Ethical and social implications of using AI in HR and other regulated contexts
    • Conventional model stress testing vs. red-team testing
      • Non-adversarial vs. adversarial stress testing approaches for AI model risk
      • Prompt injection and jailbreak testing
      • Robustness to input shifts and edge cases
  • Reporting requirements for AI model risk
    • Regulatory
      • Transparency and explainability documentation
      • Bias and fairness assessment reporting
      • Model inventory and governance oversight
    • Internal
      • The importance of quantitative risk metrics based on rigorous statistical tests
      • Continuous monitoring dashboards
      • Drift and performance degradation tracking
      • Use-case specific quantitative risk metrics and KPIs

AI workflow types

  • Assistant workflows (multi-turn chat)
    • Context management across multiple chats and conversation turns
    • Maintaining consistency and coherence
    • Handling clarification requests and corrections
  • Generation workflows (text output)
    • Structured vs. free-form text generation
    • Template-based and conditional generation
    • Quality and style consistency
  • Comprehension workflows (text input, data output)
    • Information extraction and structuring
    • Classification and categorisation
    • Data validation and quality checks

Selected use cases

  • Rating using numerical and category-based scales
    • Numeric scales (e.g., 1-5, 1-10, 0-100)
    • Category-based scales (e.g., poor/fair/good/excellent)
    • Likert scales (degree of agreement)
    • Binary classifications (pass/fail, yes/no)
  • Ranking using pointwise, pairwise, setwise, and listwise approach
    • Reliability vs cost trade-offs
    • Computational cost and latency considerations
    • Bias mitigation requirements for each approach
  • Complex document analysis using rulebooks
    • Security prospectuses
      • Extracting terms and conditions
      • Identifying risk factors and disclosures
      • Assessing the effect of legal caveats
    • Regulatory requirements and guidelines
      • Compliance checking against rulebooks
      • Interpretation of ambiguous requirements
    • Contracts
      • Key clause identification
      • Obligation and liability extraction
    • RFP and RFI questionnaires and responses
      • Requirement matching and scoring
      • Gap analysis and compliance validation
  • Data entry from free-form text
    • Trade confirmations
      • Structured field extraction (dates, amounts, counterparties)
      • Validation against expected formats and ranges
    • Free-form emails and chats with trades and market quotes
      • Field recognition by context and position
      • Handling ambiguous semantic structure
      • Handling incomplete information
    • Detecting template-generated inputs to improve reliability
      • Pattern recognition for template-generated formats
      • Leveraging template recognition for pre-approval and accuracy improvements

Practical Exercise: Building and testing AI-based workflows

The participants will build and test several AI-based multistep workflows

Note: No coding required. The exercise will be performed using an online playground.


Week 2: Tuesday 31st March
Quantitative Management of AI Model Risk

Measuring AI model risk

  • Statistical analysis of multiple runs
    • Sample size and statistical power vs. cost
    • Specialized distribution metrics for AI – not just mean and variance
    • Confidence intervals and risk reporting
  • Techniques and challenges of run randomisation
    • Temperature and sampling parameter control
    • Seed randomisation and reproducibility
    • Preamble randomisation to avoid memorisation and as a seed alternative
  • Systematic vs random errors
    • Aleatoric uncertainty (inherent randomness in LLMs)
    • Epistemic uncertainty (model capability and knowledge limitations)
    • Aleatoric-epistemic decomposition in AI error analysis
  • Dealing with rare errors and thinking tangents
    • Detection methods and reporting for low-frequency errors
    • Fast vs. thinking model differences in error patterns
    • Monitoring for unexpected reasoning paths
  • Judge models
    • Independent and comparative scoring
    • Chain-of-thought prompting for judge models
    • Judge model bias detection and mitigation

Quantitative metrics by workflow type

    • Measuring rating stability
      • Inter-run variance and consistency metrics
      • Scale calibration and score distribution analysis
      • Central tendency and range of scores
    • Measuring ranking stability
      • Rank correlation metrics (Kendall’s tau, Spearman’s rho)
      • Position bias detection and mitigation
      • Agreement rates across multiple runs
    • Measuring reliability of decision graph navigation for complex document analysis
      • Decision path consistency across runs
      • Node- and graph-level accuracy metrics
      • Error propagation through decision graph nodes
    • Measuring reliability of data entry for multiple-choice, numerical and other field types
      • Accuracy reporting by field type (multiple-choice, numerical, date)
      • Detection and mitigation of hallucinated optional fields
      • Detection and mitigation of deviations from the output format

Practical Exercise: Measuring AI model risk

The participants will perform quantitative measurement of
AI model risk in the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.


Week 3: Tuesday 7th April
Mitigation of Psychological Effects and Cognitive Biases in AI Models

Psychological effects

  • Thinking fast and slow for AI
    • System 1 (fast, intuitive) vs System 2 (slow, deliberate) thinking in LLMs
    • Failures in cognitive load optimisation
    • Switching between System 1 and System 2 in fast models vs. advanced/thinking models
    • Chain-of-thought to engage System 2 reasoning
  • Semantic illusions
    • Failures in familiarity detection
    • Misleading question structure
    • Surface-level vs. deep comprehension testing
    • Model susceptibility to deliberate semantic illusions
  • Framing effects
    • Positive vs. negative framing (e.g., rate of success vs rate of failure)
    • Influencing risk-averse vs. risk-seeking behaviour in AI
    • Stress testing with logically equivalent reformulations
  • Priming effects
    • Influence of unrelated context on responses
    • Legal and compliance implications in HR and other regulated contexts
    • Mitigation through context randomisation

Cognitive biases

    • Confirmation bias, sycophancy, desire to please
      • Guessing and meeting user assumptions and expectations at the expense of accuracy
      • Seeking information that confirms priors
      • Advocating for the perceived user interests
    • Informational anchoring
      • Misinterpretation or over-reliance on initially presented information
      • Numeric anchors affecting quantitative outputs
      • Testing with varied anchor values
    • Priming-induced anchoring
      • When to expect priming-induced anchoring effects
      • Detection and mitigation strategies
      • Meeting legal and compliance requirements in HR and other regulated contexts
    • Central tendency
      • Avoiding extreme scores on rating scales in favor of midrange values
      • Reduction in ranking stability due to the variable degree of central tendency
      • Few-shot and other methods to reduce and stabilize effects of central tendency
    • Position bias
      • Favoring first-presented or last-presented items in multiple evaluation
      • Position swap testing methodology and metrics

Practical Exercise: Identifying and mitigating cognitive biases

The participants will identify and mitigate cognitive biases
affecting the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.


Week 4: Tuesday 14th April
Improving Reliability of AI-Based Workflows

Key causes of uncertainty

  • Aleatoric vs epistemic uncertainty
    • Inherent data randomness vs. model knowledge limitations
    • Measurement approaches and mitigation strategies for each type
  • Hallucinations due to the lack of grounding
    • Importance of grounding from external knowledge sources
    • Assuming facts learned from common patterns in training data
    • Detecting confident but incorrect responses
  • Psychological effects and cognitive biases
    • Impact on reliability and consistency metrics
    • Systematic vs. random bias-induced error patterns
  • Thinking tangents
    • Unexpected reasoning paths in complex rules
    • Prevention, detection and correction of undesirable thinking tangents

Mitigation by prompt and workflow design

  • Challenger models
    • Using alternative models for validation
    • Cross-model consistency checks
    • Identifying model-specific biases and errors
  • Effective grounding
    • Using retrieval-augmented generation (RAG) effectively
    • Best practices for using and creating model context protocol (MCP) servers
    • Using conventional (non-MCP) knowledge bases
    • Citation and web search source tracking
  • Multistep workflows and decision graphs (rulebooks)
    • Breaking complex tasks into manageable steps using rulebooks
    • Conditional logic and branching paths
    • Error detection and recovery mechanisms for complex rulebooks
  • Dynamic few-shot
    • Selecting relevant examples at runtime
    • Similarity ranking-based example retrieval (reverse lookup)
    • Identifying and addressing gaps in curated few-shot examples
  • Corrective few-shot
    • Learning from mistakes and failure cases
    • Negative examples and counterexamples
    • Iterative refinement and improvement

Mitigation by Monte Carlo

  • Random sampling across multiple randomized AI model runs
    as a powerful way to improve AI workflow reliability

    • Seed and prefix-based randomisation
    • Achieving statistical confidence at a predefined threshold (e.g., 99% confident)
  • Voting across multiple runs for multiple-choice outputs
    • Majority voting and consensus mechanisms
    • Weighted voting based on confidence scores
    • Handling approximate ties and low-confidence cases
  • Crowdsourcing across multiple runs for numerical and other continuous-scale outputs
    • Effective aggregation techniques in the presence of outliers
    • Distribution-agnostic confidence intervals under epistemic and aleatoric uncertainty

Practical Exercise: Using voting and crowdsourcing to improve reliability

The participants will use voting and crowdsourcing to improve reliability of the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.

  • Discount Structure
  • Early bird discount
    10% until 27th February 2026

Event Email Reminder

Error