Taming Your LLM Application

This is an article that sums up a talk I'm giving in Kaoshiung at the Taiwan Hackerhouse Meetup on Dec 9th. If you're interested in attending, you can sign up here

When building LLM applications, teams often jump straight to complex evaluations - using tools like RAGAS or another LLM as a judge. While these sophisticated approaches have their place, I've found that starting with simple, measurable metrics leads to more reliable systems that improve steadily over time.

Four levels of LLM Applications

I think there are four levels that teams should progress through as they build more reliable language model applications.

  1. Structured Outputs - Move from raw text to validated data structures
  2. Prioritizing Iteration - Using cheap metrics like recall/mrr to ensure you're nailing down the basics
  3. Fuzzing - Using synthetic data to systmetically test for edge cases
  4. LLM Judges - Using LLM as a judge to evaluate subjective aspects

Let's explore each level in more detail and see how they fit into a progression. We'll use instructor in these examples since that's what I'm most familiar with, but the concepts can be applied to other tools as well.

Are your eval improvements just pure chance?

A step that's often missed when benchmarking retrieval methods is determining if any performance difference is due to random chance. Without this crucial step, you might invest in a new system upgrade that's outperformed your old one by pure chance.

If you're comparing retrieval methods, you'll often want to know if the improvements you're seeing are due to random chance.

In this article, we'll use a simple case study to demonstrate how to answer this question, introducing a new library called indomee (a playful nod to both "in-domain" evaluation and the beloved instant noodle brand in Southeast Asia) that makes this analysis significantly easier.

We'll do so in three steps:

  1. First we'll simulate some fake data using numpy
  2. Then we'll demonstrate how to do bootstrapping using nothing but numpy before visualising the results with matplotlib
  3. Finally we'll perform a paired t-test to determine if the differences are statistically significant

What Makes Good Documentation

Over the past year, we've grown instructor's documentation to over 60,000 lines of content. This means for every line of code in our library, we've written 5 lines of documentation. Through this process, I've realized that great documentation isn't just about explaining features - it's about demonstrating value.

Why User Intent matters the most for Synthetic Data

Introduction

I've generated millions of tokens worth of synthetic data over the last few weeks, and I've learned something surprising: everyone talks about using different personas or complex question structures when creating synthetic data, but they're missing what really matters.

The most important thing is actually understanding why users are asking their questions in the first place - their intent.

Let's explore this concept using Peek, an AI personal finance bot, as our case study.

By examining how synthetic data generation evolves from basic documentation-based approaches to intent-driven synthesis, we'll see why focusing on user intent produces more valuable training data.

You're probably using LLMs wrongly

In this post, I'll share three simple strategies that have transformed how I work with language models.

  1. Treat working with language models as an iterative process where you improve the quality of your outputs over time.
  2. Collect good examples that you can use as references for future prompts.
  3. Regularly review your prompts and examples to understand what works and what doesn't.

Most complaints I hear about language models are about hallucinations or bad outputs. These aren't issues with the technology itself. It's usually because we're not using these models the right way. Think about the last time you hired someone new. You didn't expect them to nail everything perfectly on day one.

The same principle applies to language models.

Setting Up My MacBook for ML Development: A Living Guide

This is a living document that I'll be updating over the next few weeks as I continue setting up and optimizing my new MacBook development environment. While I followed most of Eugene Yan's excellent minimal setup guide, I made some specific choices based on my workflow needs.

Core Development Setup

I kept things minimal for my general development environment, following most of Eugene's recommendations but focusing on just the essentials:

Write Stupid Evals

Evals should start simple and get progressively more complex. It's important to start simple because what we're aiming to do is to build a habit for writing these simple assertions early. By doing so, we can start taking our vibes and turning them into objective metrics. This allows us to compare different approaches easily and make a data-driven decision into what works and what doesn't.

Don't overthink it and really just use a assert statement at the start.

There's a famous story about a pottery teacher who divided their class into two groups. The first group would be graded solely on quantity - how many pieces they could produce. The second group would be graded on quality - they just needed to produce one perfect piece. When grading time came, something interesting happened: the best pieces all came from the quantity group. While the quality group got stuck theorizing about perfection, the quantity group learned through iterative practice.

Is RAG dead?

What is RAG?

RAG is a fancy way of stuffing additional information into the prompt of a language model. By giving the model more information, we can get more contextual responses that are contextually relevant to what we need. But don't language models already have access to all of the world's information?

Imagine you're starting a new job. Would you rather:

  1. Have access to all of Wikipedia and hope the information you need is somewhere in there
  2. Have your company's specific documentation, procedures, and guidelines

Synthetic Data is no Free Lunch

I spent some time playing with a new framework called Dria recently that uses LLMs to generate synthetic data. I couldn't get it to work but I did spend some time digging through their source code, and I thought I'd share some of my thoughts on the topic.

Over the past few weeks, I've generated a few million tokens of synthetic data for some projects. I'm still figuring out the best way to do it but I think it's definitely taught me that it's no free lunch. You do need to spend some time thinking about how to generate the data that you want.

The Premise

An example

When I first started generating synthetic data for question-answering systems, I thought it would be straightforward - all I had to do was to ask a language model to generate a few thousand questions that a user might ask.

You're probably not doing experiments right

I recently started working as a research engineer and it's been a significant mindset shift in how I approach my work. it's tricky to run experiments with LLMs efficiently and accurately and after months of trial and error, I've found that there are three key factors that make the biggest difference

  1. Being clear about what you're varying
  2. Investing time to build out some infrastructure
  3. Doing some simple sensitivity analysis

Let's see how each of these can make a difference in your experimental workflow.