Introduction, Requirements, and Use Cases
Week 1: Tuesday 24th March
Introduction to AI model risk
- AI model risk as a new discipline
- Rapid adoption of AI in financial services
- Explainability and transparency requirements
- Principles of quantitative measurement and reporting of AI model risk based on rigorous statistical tests
- Changes compared to traditional risk management
- Conventional model risk management (MRM) vs. AI model management
- Periodic validation vs. continuous assurance
- Traditional backtesting vs. new validation techniques for LLMs
- The evolving requirements for formal reporting
- Conventional operational risk (OpRisk) vs. AI model risk
- Challenges of adopting OpRisk metrics for AI model risk
- Ethical and social implications of using AI in HR and other regulated contexts
- Conventional model stress testing vs. red-team testing
- Non-adversarial vs. adversarial stress testing approaches for AI model risk
- Prompt injection and jailbreak testing
- Robustness to input shifts and edge cases
- Conventional model risk management (MRM) vs. AI model management
- Reporting requirements for AI model risk
- Regulatory
- Transparency and explainability documentation
- Bias and fairness assessment reporting
- Model inventory and governance oversight
- Internal
- The importance of quantitative risk metrics based on rigorous statistical tests
- Continuous monitoring dashboards
- Drift and performance degradation tracking
- Use-case specific quantitative risk metrics and KPIs
- Regulatory
AI workflow types
- Assistant workflows (multi-turn chat)
- Context management across multiple chats and conversation turns
- Maintaining consistency and coherence
- Handling clarification requests and corrections
- Generation workflows (text output)
- Structured vs. free-form text generation
- Template-based and conditional generation
- Quality and style consistency
- Comprehension workflows (text input, data output)
- Information extraction and structuring
- Classification and categorisation
- Data validation and quality checks
Selected use cases
- Rating using numerical and category-based scales
- Numeric scales (e.g., 1-5, 1-10, 0-100)
- Category-based scales (e.g., poor/fair/good/excellent)
- Likert scales (degree of agreement)
- Binary classifications (pass/fail, yes/no)
- Ranking using pointwise, pairwise, setwise, and listwise approach
- Reliability vs cost trade-offs
- Computational cost and latency considerations
- Bias mitigation requirements for each approach
- Complex document analysis using rulebooks
- Security prospectuses
- Extracting terms and conditions
- Identifying risk factors and disclosures
- Assessing the effect of legal caveats
- Regulatory requirements and guidelines
- Compliance checking against rulebooks
- Interpretation of ambiguous requirements
- Contracts
- Key clause identification
- Obligation and liability extraction
- RFP and RFI questionnaires and responses
- Requirement matching and scoring
- Gap analysis and compliance validation
- Security prospectuses
- Data entry from free-form text
- Trade confirmations
- Structured field extraction (dates, amounts, counterparties)
- Validation against expected formats and ranges
- Free-form emails and chats with trades and market quotes
- Field recognition by context and position
- Handling ambiguous semantic structure
- Handling incomplete information
- Detecting template-generated inputs to improve reliability
- Pattern recognition for template-generated formats
- Leveraging template recognition for pre-approval and accuracy improvements
- Trade confirmations
Practical Exercise: Building and testing AI-based workflows
The participants will build and test several AI-based multistep workflows
Note: No coding required. The exercise will be performed using an online playground.
Week 2: Tuesday 31st March
Quantitative Management of AI Model Risk
Measuring AI model risk
- Statistical analysis of multiple runs
- Sample size and statistical power vs. cost
- Specialized distribution metrics for AI – not just mean and variance
- Confidence intervals and risk reporting
- Techniques and challenges of run randomisation
- Temperature and sampling parameter control
- Seed randomisation and reproducibility
- Preamble randomisation to avoid memorisation and as a seed alternative
- Systematic vs random errors
- Aleatoric uncertainty (inherent randomness in LLMs)
- Epistemic uncertainty (model capability and knowledge limitations)
- Aleatoric-epistemic decomposition in AI error analysis
- Dealing with rare errors and thinking tangents
- Detection methods and reporting for low-frequency errors
- Fast vs. thinking model differences in error patterns
- Monitoring for unexpected reasoning paths
- Judge models
- Independent and comparative scoring
- Chain-of-thought prompting for judge models
- Judge model bias detection and mitigation
Quantitative metrics by workflow type
-
- Measuring rating stability
- Inter-run variance and consistency metrics
- Scale calibration and score distribution analysis
- Central tendency and range of scores
- Measuring ranking stability
- Rank correlation metrics (Kendall’s tau, Spearman’s rho)
- Position bias detection and mitigation
- Agreement rates across multiple runs
- Measuring reliability of decision graph navigation for complex document analysis
- Decision path consistency across runs
- Node- and graph-level accuracy metrics
- Error propagation through decision graph nodes
- Measuring reliability of data entry for multiple-choice, numerical and other field types
- Accuracy reporting by field type (multiple-choice, numerical, date)
- Detection and mitigation of hallucinated optional fields
- Detection and mitigation of deviations from the output format
- Measuring rating stability
Practical Exercise: Measuring AI model risk
The participants will perform quantitative measurement of
AI model risk in the workflows they built.
Note: No coding required. The exercise will be performed using an online playground.
Week 3: Tuesday 7th April
Mitigation of Psychological Effects and Cognitive Biases in AI Models
Psychological effects
- Thinking fast and slow for AI
- System 1 (fast, intuitive) vs System 2 (slow, deliberate) thinking in LLMs
- Failures in cognitive load optimisation
- Switching between System 1 and System 2 in fast models vs. advanced/thinking models
- Chain-of-thought to engage System 2 reasoning
- Semantic illusions
- Failures in familiarity detection
- Misleading question structure
- Surface-level vs. deep comprehension testing
- Model susceptibility to deliberate semantic illusions
- Framing effects
- Positive vs. negative framing (e.g., rate of success vs rate of failure)
- Influencing risk-averse vs. risk-seeking behaviour in AI
- Stress testing with logically equivalent reformulations
- Priming effects
- Influence of unrelated context on responses
- Legal and compliance implications in HR and other regulated contexts
- Mitigation through context randomisation
Cognitive biases
-
- Confirmation bias, sycophancy, desire to please
- Guessing and meeting user assumptions and expectations at the expense of accuracy
- Seeking information that confirms priors
- Advocating for the perceived user interests
- Informational anchoring
- Misinterpretation or over-reliance on initially presented information
- Numeric anchors affecting quantitative outputs
- Testing with varied anchor values
- Priming-induced anchoring
- When to expect priming-induced anchoring effects
- Detection and mitigation strategies
- Meeting legal and compliance requirements in HR and other regulated contexts
- Central tendency
- Avoiding extreme scores on rating scales in favor of midrange values
- Reduction in ranking stability due to the variable degree of central tendency
- Few-shot and other methods to reduce and stabilize effects of central tendency
- Position bias
- Favoring first-presented or last-presented items in multiple evaluation
- Position swap testing methodology and metrics
- Confirmation bias, sycophancy, desire to please
Practical Exercise: Identifying and mitigating cognitive biases
The participants will identify and mitigate cognitive biases
affecting the workflows they built.
Note: No coding required. The exercise will be performed using an online playground.
Week 4: Tuesday 14th April
Improving Reliability of AI-Based Workflows
Key causes of uncertainty
- Aleatoric vs epistemic uncertainty
- Inherent data randomness vs. model knowledge limitations
- Measurement approaches and mitigation strategies for each type
- Hallucinations due to the lack of grounding
- Importance of grounding from external knowledge sources
- Assuming facts learned from common patterns in training data
- Detecting confident but incorrect responses
- Psychological effects and cognitive biases
- Impact on reliability and consistency metrics
- Systematic vs. random bias-induced error patterns
- Thinking tangents
- Unexpected reasoning paths in complex rules
- Prevention, detection and correction of undesirable thinking tangents
Mitigation by prompt and workflow design
- Challenger models
- Using alternative models for validation
- Cross-model consistency checks
- Identifying model-specific biases and errors
- Effective grounding
- Using retrieval-augmented generation (RAG) effectively
- Best practices for using and creating model context protocol (MCP) servers
- Using conventional (non-MCP) knowledge bases
- Citation and web search source tracking
- Multistep workflows and decision graphs (rulebooks)
- Breaking complex tasks into manageable steps using rulebooks
- Conditional logic and branching paths
- Error detection and recovery mechanisms for complex rulebooks
- Dynamic few-shot
- Selecting relevant examples at runtime
- Similarity ranking-based example retrieval (reverse lookup)
- Identifying and addressing gaps in curated few-shot examples
- Corrective few-shot
- Learning from mistakes and failure cases
- Negative examples and counterexamples
- Iterative refinement and improvement
Mitigation by Monte Carlo
- Random sampling across multiple randomized AI model runs
as a powerful way to improve AI workflow reliability- Seed and prefix-based randomisation
- Achieving statistical confidence at a predefined threshold (e.g., 99% confident)
- Voting across multiple runs for multiple-choice outputs
- Majority voting and consensus mechanisms
- Weighted voting based on confidence scores
- Handling approximate ties and low-confidence cases
- Crowdsourcing across multiple runs for numerical and other continuous-scale outputs
- Effective aggregation techniques in the presence of outliers
- Distribution-agnostic confidence intervals under epistemic and aleatoric uncertainty
Practical Exercise: Using voting and crowdsourcing to improve reliability
The participants will use voting and crowdsourcing to improve reliability of the workflows they built.
Note: No coding required. The exercise will be performed using an online playground.