Taming Your LLM Application
This is an article that sums up a talk I'm giving in Kaoshiung at the Taiwan Hackerhouse Meetup on Dec 9th. If you're interested in attending, you can sign up here
When building LLM applications, teams often jump straight to complex evaluations - using tools like RAGAS or another LLM as a judge. While these sophisticated approaches have their place, I've found that starting with simple, measurable metrics leads to more reliable systems that improve steadily over time.
Four levels of LLM Applications
I think there are four levels that teams should progress through as they build more reliable language model applications.
- Structured Outputs - Move from raw text to validated data structures
- Prioritizing Iteration - Using cheap metrics like recall/mrr to ensure you're nailing down the basics
- Fuzzing - Using synthetic data to systmetically test for edge cases
- LLM Judges - Using LLM as a judge to evaluate subjective aspects
Let's explore each level in more detail and see how they fit into a progression. We'll use instructor
in these examples since that's what I'm most familiar with, but the concepts can be applied to other tools as well.