Software for the generative age: Rethinking Quality Assurance for AI

The first three articles in this series established the core problem: generative AI is probabilistic by design, and that changes how software gets built, designed, and deployed. This article is about what it does to the people responsible for making sure software works.

The instinct that follows is understandable: if you can’t predict exactly what a system will produce, you probably can’t hold it to the same quality standards as one you can. Accepting variability means accepting looser expectations. The testing bar gets lower.

That instinct is wrong. And acting on it is one of the more consequential mistakes an organization can make when deploying generative AI.

Probabilistic systems don’t need less rigorous quality assurance. They need more of it, applied earlier, maintained longer, and involving more people than traditional testing ever required. The bar doesn’t lower. The work required to meet it changes fundamentally.

What breaks and why

Traditional Quality Assurance (QA) rests on one thing: that identical inputs produce identical outputs. That assumption is so foundational that most of the discipline is built on top of it without ever stating it explicitly. Write a test, define an expected value, run the system, check whether they match. If they do, the system works. If they don’t, something is broken.

Generative AI violates that assumption by design, and the problems stack up.

Outputs vary between runs even when the system is working correctly, which means there’s no clean expected value to test against and no deterministic chain of logic to trace when something goes wrong. A government chatbot asked the same eligibility question twice might respond with a numbered list the first time and a conversational paragraph the second. Both could be perfectly correct. A test expecting a specific string fails both times. Not because the system is broken, but because the test was built for a different kind of system. The space of acceptable responses is a range, sometimes a very wide one, and when a bad output does occur, recreating the exact conditions that caused it is difficult for the same reason: the non-determinism that makes outputs vary is the same non-determinism that makes failures hard to pin down.

The system also changes under you in ways traditional software doesn’t. Models get updated. The same prompt against a newer version of the underlying model may produce different results. Tests that passed last month may fail today not because anything broke, but because the model changed. A function doesn’t quietly start behaving differently because someone updated a dependency. A model can.

And the system can confidently produce false information. Generative AI hallucinates, generating plausible-sounding responses that are factually wrong, sometimes with no signal that anything is amiss. Unlike a traditional system that either returns a result or throws an error, a generative system will fill a gap in its knowledge with something that sounds right. Testing needs to catch this. Catching it requires ground truth. Assembling that ground truth for complex domains (policy, law, healthcare, eligibility) requires subject matter experts before a line of code is written.

None of this means testing is impossible. It means the methods have to change, and the organization has to change with them.

The hardest problem isn’t technical

In traditional QA, the test oracle (the authoritative definition of correct behaviour) comes from a specification. Someone writes down what the system should do, and QA tests whether it does that. The specification might be incomplete or poorly written, but it exists, and it can be handed to a testing team.

For generative AI, that specification doesn’t exist in the same form. The expected output isn’t a value; it’s a range. Defining that range—what the system must always do, what it must never do, what counts as a factual error versus an acceptable variation in phrasing, what tone is appropriate, when a response is good enough—requires bringing together people who don’t typically sit in the same room during a testing conversation.

Product managers need to define what the system is for. Policy owners need to define what it’s allowed to say. Legal teams need to define what it can’t commit to. Subject matter experts need to define what accuracy looks like in their domain. And the eventual users of the system, whether caseworkers, citizens, or customers, need to validate that the definition of “good enough” actually serves them.

People sitting around a boardroom table, working and collaborating, signifying all the different knowledge bases required to define quality criteria.

Getting all of those people aligned on quality criteria before the first prompt is written is harder than writing a test suite. That’s what determines whether the test suite means anything. A QA process built on poorly defined quality criteria is just automated noise: expensive, time-consuming, and giving false confidence that something has been properly evaluated.

What generative AI demands of QA teams is a seat at the table much earlier in the process, with the authority to ask “what does good actually look like here?” and the organizational backing to get a real answer.

Generative AI testing scenarios cover—A QA Testing Guide from OXD

Want to see how this works in practice?

Our scenarios guide has practical testing approaches for three common generative AI deployments.

Get the guide

From asserting correctness to evaluating quality

Once the quality criteria exist, the testing practice itself can shift in seven concrete directions.

Evaluators instead of assertions

Replace string-match assertions with automated evaluators that score outputs across multiple dimensions: relevance, factual accuracy, tone, presence of sensitive data, and consistency across runs. Some evaluators are deterministic, such as a regex that checks for personally identifiable information in a response. Others are AI-powered: a separate model that judges whether a response is on-topic or contradicts a policy document. Neither is perfect, but together they cover ground that exact-match testing can’t touch.

Property testing instead of case-by-case coverage

Instead of testing one input against one expected output, define a rule that must hold across a large number of generated inputs, then verify it against hundreds of variants. If your system handles benefits eligibility questions, one property might be “the response must never contradict the policy documentation.” You don’t care exactly what the response says, only that it stays within that bound. When it doesn’t, you have a real failure to investigate.

Benchmarking instead of point-in-time gates

With generative AI, quality is better understood as a moving baseline than a pass/fail gate. Establish performance metrics before launch, measure them continuously, and treat meaningful drops as regressions even when no individual test fails in a traditional sense. If a model update causes response relevance scores to drop from 87% to 74%, that’s worth investigating even if no alert is fired.

Continuous monitoring as part of QA practice

Production behaviour is data. Logging and analyzing outputs in the real world, flagging responses that fall outside established bounds, and feeding those cases back into the test suite is not a post-launch afterthought. It’s where a meaningful portion of your quality signal comes from. Generative systems don’t fail in the same ways during controlled testing as they do when real users get creative with them.

Red teaming for adversarial coverage

Standard test suites are designed by the same team that built the system. They tend to probe the cases that the team thought of. Red teaming brings in testers whose explicit job is to break the system: extracting sensitive information, making it contradict itself, finding the phrasing that causes it to behave in ways it shouldn’t. For any system that interacts with the public or handles sensitive information, red teaming isn’t optional.

Validating reasoning, not just outputs

For high-stakes applications, the output isn’t the only thing that needs to be trustworthy. The reasoning that produced it does too. A summary that reaches the right conclusion through a flawed chain of inference is not a reliable summary. A caseworker or citizen who reads it can’t tell the difference. Testing for reasoning quality means checking whether the steps that led to a conclusion are sound, not just whether the conclusion passes a surface evaluation. This is harder to automate than output scoring, which is part of why it belongs in the human review tier.

Human review where automation reaches its limits

Automated evaluators are good at scale and consistency. They’re less good at catching a response that is technically accurate but practically useless, or guidance that is literally correct but would reasonably lead someone to the wrong conclusion. Structuring human review into the QA process, with clear criteria, consistent sampling, and a feedback loop back into the evaluator suite, is what bridges that gap. The goal isn’t to review everything. It’s to review the right things. A useful benchmark: when human evaluators independently assess the same outputs the system produced, they should agree with the system’s judgments (for example 85% of the time) before a deployment is considered production-ready. Consistent disagreement below that threshold is a signal that the system’s quality definitions need revisiting, not just its outputs.

We tested whether AI could coach a non-expert through accessibility QA in real time. Read the experiment.

What this means in practice

The role of a QA team changes more than the tools do. Writing assertion scripts becomes a smaller part of the job. The larger part is the work that happens before any test is written: facilitating the conversations that produce quality definitions, getting product managers, policy owners, legal teams, and subject matter experts aligned on what good looks like, and then building the evaluator suites, benchmarks, and monitoring infrastructure that test against those definitions continuously.

The organizations that get this right bring QA into the room at the beginning, not to write tests, but to ask the question that makes every subsequent test meaningful. That’s a different kind of work. It requires more judgment, more collaboration, and more sustained attention than traditional testing.

The goal hasn’t changed: software that behaves the way it’s supposed to, and that users can trust. What it takes to get there has—and so has the cost of getting it wrong.

Want to see how this works in practice?

Our scenarios guide has practical testing approaches for three common generative AI deployments.

Get the guide

Have you read the first three articles in our series yet?

Software for the generative age: From precision to probability discusses how generative AI is moving software from deterministic systems to non-deterministic ones.
Software for the generative age: Designing for non-determinism explores how to shift from predictable design to systems that embrace uncertainty.
Software for the generative age: Building around non-deterministic systems reveals the practical realities of building around non-deterministic systems.

What breaks and why

The hardest problem isn’t technical

Want to see how this works in practice?

From asserting correctness to evaluating quality

Evaluators instead of assertions

Property testing instead of case-by-case coverage

Benchmarking instead of point-in-time gates

Continuous monitoring as part of QA practice

Red teaming for adversarial coverage

Validating reasoning, not just outputs

Human review where automation reaches its limits

What this means in practice

Want to see how this works in practice?

Have you read the first three articles in our series yet?

Subscribe for insights, career opportunities, and events. No spam, of course.

You might also like