Testing

December 5, 2024
in LLMs, Testing, Evaluation
11 min read

Taming Your LLM Application

This is an article that sums up a talk I'm giving in Kaoshiung at the Taiwan Hackerhouse Meetup on Dec 9th. If you're interested in attending, you can sign up here

When building LLM applications, teams often jump straight to complex evaluations - using tools like RAGAS or another LLM as a judge. While these sophisticated approaches have their place, I've found that starting with simple, measurable metrics leads to more reliable systems that improve steadily over time.

Five levels of LLM Applications

I think there are five levels that teams seem to progress through as they build more reliable language model applications.

Structured Outputs - Move from raw text to validated data structures
Prioritizing Iteration - Using cheap metrics like recall/mrr to ensure you're nailing down the basics
Fuzzing - Using synthetic data to systmetically test for edge cases
Segmentation - Understanding the weak points of your model
LLM Judges - Using LLM as a judge to evaluate subjective aspects

Let's explore each level in more detail and see how they fit into a progression. We'll use instructor in these examples since that's what I'm most familiar with, but the concepts can be applied to other tools as well.