How do I actually test my AI application?

Testing and optimising an LLM-based application feels more like medical diagnosis and treatment than traditional debugging. Problems are often difficult to find, and their causes usually cannot be determined with absolute certainty.

The development and use of AI applications, especially those based on large language models (LLMs), presents us with completely new challenges in testing. The methods we are accustomed to from traditional software development are only of limited use here. Why is that?

Different from traditional software development

In classic software development, we work with a limited range of inputs. These are contrasted with outputs that are generated from the inputs in a regular and traceable manner.

Errors are usually obvious: the programme does not run, throws an error message or delivers clearly incorrect results. If the output is incorrect, this can only be because our software code that generates the output is faulty, or because we did not anticipate this input. We have to correct errors in the code, and we can prevent unexpected (and unwanted) input if necessary.

It therefore makes sense to treat all errors independently of each other. It is possible that fixing an error may affect existing functionality, but this usually has to do with changes to shared functions, data structures, or the like, and does not mean that the cases are inextricably linked.

In LLM-based application development, the world looks different. Errors are not obvious; instead of an error message, a plausible response is generated. This means that we first have to find the error. Even with simple applications, the range of possible inputs is practically unlimited and often cannot be restricted. Furthermore, there are no selective, isolated interventions. Everything we give the LLM at a given moment – instructions, context, user input – acts as a whole. Improving one case can worsen twenty others.

The complexity explodes when it comes to integration/system testing of agent systems. Then errors can propagate in unforeseen ways. And finally, instructions are not always followed: LLMs sometimes interpret instructions idiosyncratically.

What are the consequences?

When testing and optimising LLM-based applications, it does not make sense to consider individual cases. If we try to improve quality in specific areas by providing better instructions, we still need to test extensively – even cases that have already worked: Regular regression testing for the entire system and its subsystems (if any) is virtually mandatory.

Instead of relying on binary pass/fail results, testing processes for LLMs must be based on statistical metrics such as precision, recall or F1 scores. These metrics show how well the system performs on average – but they also mask weaknesses in rare or particularly critical cases. A pragmatic approach is to define thresholds for metrics (e.g., ‘95% of responses must be correct’) while also providing for manual reviews for high-impact cases (e.g., medical advice).

First, we need to find the errors: we need extensive test data with known correct results. But even then, it is not trivial to determine whether the generated output matches the test data. Only in the simplest yes/no or multiple-choice outputs are matches or deviations obvious; in tasks such as summarising texts, chat responses, etc., evaluating the test results is often a challenge in itself that requires the support of a language model.

Some critical questions remain

  • There is an urgent need for analysis and insight: Is analysis possible? Can we determine where the LLM frequently deviates from the instructions? Where might instructions conflict with each other?
  • What happens if we suddenly get worse results with more instructions?
  • Test data must not only be extensive, it must also cover complex cases, e.g. multi-turn interactions (conversations with several consecutive questions and answers).
  • And even much larger test sets will only ever contain the cases (with many variations) that we have anticipated. What about the cases we cannot anticipate?

In addition to extensive automated testing, ‘on-the-job training’ is also required – learning from real-life usage scenarios.

Conclusions

Will we have to test software in the future in the same way that new medicines are approved – with ‘preclinical’ and ‘clinical trials’ and ‘approval’?

Even if we don't want to go that far, we need fluid evaluation scales and comprehensive test cases. We need regression tests and a good idea of how we can identify errors as such.

And we must abandon the notion that we can test an application once, release it and then run it unchanged. Instead, we need new methods and tools for continuous quality assurance and performance monitoring.