Generative AI testing scenarios

From theory to practice: A companion resource to Software for the generative age: Rethinking quality assurance for AI.

How to test three common generative AI deployments

Generative AI testing requires a different approach than traditional software QA—but knowing the principles and knowing how to apply them are two different things. This guide takes the core techniques from probabilistic QA and shows what they look like when applied to three systems organizations are actively building right now.

The three deployments covered here represent distinct archetypes: conversational systems that generate responses in real time, transformation systems that process a fixed input into a derived output, and AI-assisted decision systems where a generative AI layer sits on top of a deterministic rules engine. Each has a different primary failure mode, and each demands a different testing emphasis. Find the one that most closely matches what you’re building and use it as a starting point for your own testing approach.

1. Conversational systems

A customer service chatbot for a financial services provider.

The application

A chatbot handling support inquiries (billing questions, policy lookups, complaints) for an insurance company or bank.

The challenge

The chatbot needs to be accurate, consistent, and safe across an enormous range of inputs it wasn’t specifically trained for. The same billing question phrased by ten different customers might produce ten different responses. Some of those responses could be wrong. Some could make commitments the company can’t honour. A frustrated customer handled poorly doesn’t just remain a dissatisfied customer—they become a complaint about the chatbot itself. Standard automated tests checking for exact response strings are useless here.

How to test it

Define the behavioural envelope
Before development begins, define what this chatbot is and isn’t authorized to do: what topics it can discuss, what questions must always escalate to a human agent, what tone is acceptable and what isn’t, and what it must never claim it can do.

Build a grounded test suite
Use two sources: real historical customer queries (anonymized) and synthetically generated variants that cover question types the system will encounter but your history may underrepresent. Synthetic variants let you deliberately probe unusual phrasings, emotionally charged questions, and requests the system isn’t authorized to handle. Together they give you a corpus that’s grounded in reality and adversarially complete.

Run evaluators against every response in that suite
Relevance: does this response address what the customer actually asked? Tone: is it neutral to positive, or does it read as cold or dismissive? Factual accuracy: does it accurately represent the company’s actual policies? Safety: does it make any commitment or claim the company hasn’t authorized?

Run property tests across question variants
If your chatbot handles refund policy questions, generate fifty ways of asking “can I get a refund?” and verify that none of them produce responses that contradict each other or contradict the written policy.

Red team specifically for unauthorized commitments
An adversarial tester’s job is to find the phrasing that makes the chatbot say ”yes, we can refund that" when the correct answer is “that’s not something I can authorize, but I can connect you with someone who can.” This isn’t a hypothetical risk. Air Canada’s chatbot made a commitment to a customer about bereavement fares that the airline then refused to honour. A tribunal held Air Canada responsible regardless. Companies are liable for what their chatbots promise.

Monitor production continuously
Track response quality scores over time. When scores drop, investigate before customers start complaining.

2. Transformation systems

An AI document summarization tool for government caseworkers.

The application

An AI tool that helps caseworkers process lengthy application documents (eligibility assessments, supporting evidence, policy references) by summarizing key information before a caseworker makes a decision.

The challenge

Missing an eligibility condition in a summary, or misrepresenting what a document says, can affect whether a real person receives a benefit they’re entitled to. Summaries will vary in structure and phrasing across runs even for identical source documents. The question of what counts as a “correct” summary is genuinely hard to answer.

How to test it

Establish ground truth benchmarks
Before the tool goes anywhere near a caseworker, have policy experts and experienced caseworkers review a representative sample of documents and define what a good summary must include: which facts are mandatory, what level of detail is required, what the tool must never omit or fabricate. These become the evaluation criteria. This step should come first — everything else depends on having this definition in place.

Test for faithfulness, not consistency
Verify that every claim in a summary is supported by the source document, catching both fabrications and distortions. A summary that varies in phrasing across two runs is fine. One that introduces a claim not in the source, omits a key eligibility condition, or drops information from later sections of a long document is not. Faithfulness tends to degrade toward the end of long outputs—test specifically with documents that bury critical conditions deep in the text.

Run equity evaluations
Ask whether summaries of applications from different demographic groups, written in different registers of language, are consistently complete and accurate. A tool that does well on formal, well-structured applications but poorly on plain-language ones is introducing bias into the process, even if nobody designed it that way. LSE researchers found that gender-related bias emerged unprompted in AI summarization of real social care case notes: men’s health needs were framed as more complex than women’s despite similar circumstances. To test for this, take matched pairs of applications describing identical circumstances but written in different registers or from different demographic groups and compare the completeness and accuracy of the summaries. Systematic differences are a signal worth investigating before deployment.

Test edge cases explicitly
Very long documents, documents with internally conflicting information, and documents that reference legislation the model may not have seen are the cases most likely to produce unreliable summaries—and the most likely to affect critical decisions.

Verify the human handoff
The AI summarizes. The caseworker reads the summary and makes the decision. Testing should verify that this handoff works correctly. That means checking that the tool’s output language positions it as an aid rather than a conclusion: a summary that says “the applicant states they meet the income threshold” is appropriately hedged; one that says “the applicant meets the income threshold” is making an assertion the tool isn’t authorized to make. It also means verifying that when the tool is uncertain, that uncertainty is visible. A summary that silently omits a section it couldn’t parse is more dangerous than one that flags the gap.

3. Decision systems

An AI-powered citizen eligibility and guidance tool.

The application

A provincial ministry deploys an AI tool that helps citizens understand whether they might qualify for programs (child care subsidies, housing assistance, business licensing) by answering questions in plain language, rather than sending people through rigid eligibility wizards.

The challenge

When a citizen service gives different people different answers to the same question, that’s not just a quality problem. It’s a fairness problem with potential legal consequences. If a citizen received guidance that led them to not apply for a benefit they were entitled to, the ministry needs to be able to explain why that happened. The AI also sits alongside legacy systems that must stay deterministic. The actual eligibility determination runs on a rules engine that produces the same output every time. The boundary between what the AI is permitted to say, and what it must defer to the rules engine for, is itself a testing concern.

How to test it

For Canadian provincial deployments, the NIST AI Risk Management Framework has become the most widely adopted structure for governing AI in public services. The testing activities below address its core questions: what risks were identified, what was done to evaluate them, and what triggers a review. Testing against it doesn’t mean producing a compliance document—it means having evidence that you did the work.

Define and test the boundary
Define the boundary between the AI guidance layer and the deterministic rules engine before development starts, and test that it holds under adversarial conditions. An adversarial tester’s job here is to find the phrasing that gets the AI to make an eligibility determination it isn’t authorized to make, to say “yes, you qualify” instead of “based on what you’ve described, you may want to apply and the system will assess your eligibility.” This boundary test needs to run before launch and again after any model update, because changes to the underlying model can shift where that line sits.

Run equity evaluations across linguistic and demographic variation
Take a set of scenarios where the correct guidance is known, then ask those questions in multiple forms: plain English, formal language, simplified language, and translated versions where the service operates multilingually. Where gaps appear, they’re not tone problems—they’re equity failures that affect whether people can access services they’re entitled to. Document what you tested and what the results were, because if a citizen later claims they received inadequate guidance, the ministry needs to be able to show this was evaluated.

Run property tests for policy consistency
If a citizen’s circumstances qualify them for a benefit, every phrasing of their question should produce guidance that at minimum doesn’t discourage them from applying. Test hundreds of variants and flag any that contradict the correct guidance.

Audit logging is a testing requirement, not an infrastructure detail
Government services need to be able to reconstruct what the system told a citizen and when. Test that logging captures what it needs to capture, that it doesn’t capture what it shouldn’t (PII that wasn’t necessary to log), and that the logs are actually readable and searchable after the fact. A log that exists but can’t be practically queried doesn’t satisfy accountability requirements.

When human evaluators independently assess the same outputs your AI produces, they should agree with its assessments at least 85% of the time before deployment is considered production-ready.

Humans aren’t a fallback. They’re part of the design.

Every scenario above sits at the intersection of automated evaluation and human judgment. Automation handles volume—it can score thousands of outputs against defined criteria efficiently and consistently. But human judgment is irreplaceable for the failures that matter most: a response that is technically accurate but practically misleading, a summary that includes all the required facts but obscures the most critical one, guidance that is literally correct but would reasonably lead a citizen to the wrong conclusion.

Keeping humans meaningfully involved isn’t a concession to the limits of AI. In every one of these deployments, it’s the architecture. The chatbot escalates edge cases. The summarization tool assists the caseworker, who makes the decision. The eligibility tool answers questions but defers actual determinations to a rules engine with auditability built in. Testing should verify that these handoffs work correctly—not just that the AI component performs well in isolation, but that the whole system, including the humans in it, functions as intended.

When human evaluators independently assess the same outputs a system produces, they should agree with the system’s judgments at least 85% of the time before a deployment is considered production-ready. Consistent disagreement below that threshold isn’t a sign that the reviewers are being too strict. It’s a sign that the quality definitions need revisiting before the system goes anywhere near the people who’ll be using it.