Course Schedule And Contents

Introduction, Requirements, and Use Cases

Week 1: Tuesday 24th March
Introduction to AI model risk

AI model risk as a new discipline
- Rapid adoption of AI in financial services
- Explainability and transparency requirements
- Principles of quantitative measurement and reporting of AI model risk based on rigorous statistical tests
Changes compared to traditional risk management
- Conventional model risk management (MRM) vs. AI model management
  - Periodic validation vs. continuous assurance
  - Traditional backtesting vs. new validation techniques for LLMs
  - The evolving requirements for formal reporting
- Conventional operational risk (OpRisk) vs. AI model risk
  - Challenges of adopting OpRisk metrics for AI model risk
  - Ethical and social implications of using AI in HR and other regulated contexts
- Conventional model stress testing vs. red-team testing
  - Non-adversarial vs. adversarial stress testing approaches for AI model risk
  - Prompt injection and jailbreak testing
  - Robustness to input shifts and edge cases
Reporting requirements for AI model risk
- Regulatory
  - Transparency and explainability documentation
  - Bias and fairness assessment reporting
  - Model inventory and governance oversight
- Internal
  - The importance of quantitative risk metrics based on rigorous statistical tests
  - Continuous monitoring dashboards
  - Drift and performance degradation tracking
  - Use-case specific quantitative risk metrics and KPIs

AI workflow types

Assistant workflows (multi-turn chat)
- Context management across multiple chats and conversation turns
- Maintaining consistency and coherence
- Handling clarification requests and corrections
Generation workflows (text output)
- Structured vs. free-form text generation
- Template-based and conditional generation
- Quality and style consistency
Comprehension workflows (text input, data output)
- Information extraction and structuring
- Classification and categorisation
- Data validation and quality checks

Selected use cases

Rating using numerical and category-based scales
- Numeric scales (e.g., 1-5, 1-10, 0-100)
- Category-based scales (e.g., poor/fair/good/excellent)
- Likert scales (degree of agreement)
- Binary classifications (pass/fail, yes/no)
Ranking using pointwise, pairwise, setwise, and listwise approach
- Reliability vs cost trade-offs
- Computational cost and latency considerations
- Bias mitigation requirements for each approach
Complex document analysis using rulebooks
- Security prospectuses
  - Extracting terms and conditions
  - Identifying risk factors and disclosures
  - Assessing the effect of legal caveats
- Regulatory requirements and guidelines
  - Compliance checking against rulebooks
  - Interpretation of ambiguous requirements
- Contracts
  - Key clause identification
  - Obligation and liability extraction
- RFP and RFI questionnaires and responses
  - Requirement matching and scoring
  - Gap analysis and compliance validation
Data entry from free-form text
- Trade confirmations
  - Structured field extraction (dates, amounts, counterparties)
  - Validation against expected formats and ranges
- Free-form emails and chats with trades and market quotes
  - Field recognition by context and position
  - Handling ambiguous semantic structure
  - Handling incomplete information
- Detecting template-generated inputs to improve reliability
  - Pattern recognition for template-generated formats
  - Leveraging template recognition for pre-approval and accuracy improvements

Practical Exercise: Building and testing AI-based workflows

The participants will build and test several AI-based multistep workflows

Note: No coding required. The exercise will be performed using an online playground.

Week 2: Tuesday 31st March
Quantitative Management of AI Model Risk

Measuring AI model risk

Statistical analysis of multiple runs
- Sample size and statistical power vs. cost
- Specialized distribution metrics for AI – not just mean and variance
- Confidence intervals and risk reporting
Techniques and challenges of run randomisation
- Temperature and sampling parameter control
- Seed randomisation and reproducibility
- Preamble randomisation to avoid memorisation and as a seed alternative
Systematic vs random errors
- Aleatoric uncertainty (inherent randomness in LLMs)
- Epistemic uncertainty (model capability and knowledge limitations)
- Aleatoric-epistemic decomposition in AI error analysis
Dealing with rare errors and thinking tangents
- Detection methods and reporting for low-frequency errors
- Fast vs. thinking model differences in error patterns
- Monitoring for unexpected reasoning paths
Judge models
- Independent and comparative scoring
- Chain-of-thought prompting for judge models
- Judge model bias detection and mitigation

Quantitative metrics by workflow type

- Measuring rating stability
  - Inter-run variance and consistency metrics
  - Scale calibration and score distribution analysis
  - Central tendency and range of scores
- Measuring ranking stability
  - Rank correlation metrics (Kendall’s tau, Spearman’s rho)
  - Position bias detection and mitigation
  - Agreement rates across multiple runs
- Measuring reliability of decision graph navigation for complex document analysis
  - Decision path consistency across runs
  - Node- and graph-level accuracy metrics
  - Error propagation through decision graph nodes
- Measuring reliability of data entry for multiple-choice, numerical and other field types
  - Accuracy reporting by field type (multiple-choice, numerical, date)
  - Detection and mitigation of hallucinated optional fields
  - Detection and mitigation of deviations from the output format

Practical Exercise: Measuring AI model risk

The participants will perform quantitative measurement of
AI model risk in the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.

Week 3: Tuesday 7th April
Mitigation of Psychological Effects and Cognitive Biases in AI Models

Psychological effects

Thinking fast and slow for AI
- System 1 (fast, intuitive) vs System 2 (slow, deliberate) thinking in LLMs
- Failures in cognitive load optimisation
- Switching between System 1 and System 2 in fast models vs. advanced/thinking models
- Chain-of-thought to engage System 2 reasoning
Semantic illusions
- Failures in familiarity detection
- Misleading question structure
- Surface-level vs. deep comprehension testing
- Model susceptibility to deliberate semantic illusions
Framing effects
- Positive vs. negative framing (e.g., rate of success vs rate of failure)
- Influencing risk-averse vs. risk-seeking behaviour in AI
- Stress testing with logically equivalent reformulations
Priming effects
- Influence of unrelated context on responses
- Legal and compliance implications in HR and other regulated contexts
- Mitigation through context randomisation

Cognitive biases

- Confirmation bias, sycophancy, desire to please
  - Guessing and meeting user assumptions and expectations at the expense of accuracy
  - Seeking information that confirms priors
  - Advocating for the perceived user interests
- Informational anchoring
  - Misinterpretation or over-reliance on initially presented information
  - Numeric anchors affecting quantitative outputs
  - Testing with varied anchor values
- Priming-induced anchoring
  - When to expect priming-induced anchoring effects
  - Detection and mitigation strategies
  - Meeting legal and compliance requirements in HR and other regulated contexts
- Central tendency
  - Avoiding extreme scores on rating scales in favor of midrange values
  - Reduction in ranking stability due to the variable degree of central tendency
  - Few-shot and other methods to reduce and stabilize effects of central tendency
- Position bias
  - Favoring first-presented or last-presented items in multiple evaluation
  - Position swap testing methodology and metrics

Practical Exercise: Identifying and mitigating cognitive biases

The participants will identify and mitigate cognitive biases
affecting the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.

Week 4: Tuesday 14th April
Improving Reliability of AI-Based Workflows

Key causes of uncertainty

Aleatoric vs epistemic uncertainty
- Inherent data randomness vs. model knowledge limitations
- Measurement approaches and mitigation strategies for each type
Hallucinations due to the lack of grounding
- Importance of grounding from external knowledge sources
- Assuming facts learned from common patterns in training data
- Detecting confident but incorrect responses
Psychological effects and cognitive biases
- Impact on reliability and consistency metrics
- Systematic vs. random bias-induced error patterns
Thinking tangents
- Unexpected reasoning paths in complex rules
- Prevention, detection and correction of undesirable thinking tangents

Mitigation by prompt and workflow design

Challenger models
- Using alternative models for validation
- Cross-model consistency checks
- Identifying model-specific biases and errors
Effective grounding
- Using retrieval-augmented generation (RAG) effectively
- Best practices for using and creating model context protocol (MCP) servers
- Using conventional (non-MCP) knowledge bases
- Citation and web search source tracking
Multistep workflows and decision graphs (rulebooks)
- Breaking complex tasks into manageable steps using rulebooks
- Conditional logic and branching paths
- Error detection and recovery mechanisms for complex rulebooks
Dynamic few-shot
- Selecting relevant examples at runtime
- Similarity ranking-based example retrieval (reverse lookup)
- Identifying and addressing gaps in curated few-shot examples
Corrective few-shot
- Learning from mistakes and failure cases
- Negative examples and counterexamples
- Iterative refinement and improvement

Mitigation by Monte Carlo

Random sampling across multiple randomized AI model runs
as a powerful way to improve AI workflow reliability
- Seed and prefix-based randomisation
- Achieving statistical confidence at a predefined threshold (e.g., 99% confident)
Voting across multiple runs for multiple-choice outputs
- Majority voting and consensus mechanisms
- Weighted voting based on confidence scores
- Handling approximate ties and low-confidence cases
Crowdsourcing across multiple runs for numerical and other continuous-scale outputs
- Effective aggregation techniques in the presence of outliers
- Distribution-agnostic confidence intervals under epistemic and aleatoric uncertainty

Practical Exercise: Using voting and crowdsourcing to improve reliability

The participants will use voting and crowdsourcing to improve reliability of the workflows they built.

Note: No coding required. The exercise will be performed using an online playground.

Introduction, Requirements, and Use Cases

Week 1: Tuesday 24th March Introduction to AI model risk

AI workflow types

Selected use cases

Week 2: Tuesday 31st March Quantitative Management of AI Model Risk

Measuring AI model risk

Quantitative metrics by workflow type

Week 3: Tuesday 7th April Mitigation of Psychological Effects and Cognitive Biases in AI Models

Psychological effects

Cognitive biases

Week 4: Tuesday 14th April Improving Reliability of AI-Based Workflows

Key causes of uncertainty

Mitigation by prompt and workflow design

Mitigation by Monte Carlo

Event Email Reminder

Week 1: Tuesday 24th March
Introduction to AI model risk

Week 2: Tuesday 31st March
Quantitative Management of AI Model Risk

Week 3: Tuesday 7th April
Mitigation of Psychological Effects and Cognitive Biases in AI Models

Week 4: Tuesday 14th April
Improving Reliability of AI-Based Workflows