Evals

January 27, 2025
in Evals, Instructor, Synthetic Data
8 min read

Why Structured Outputs matter for LLM Applications in 2025

I gave a short talk at NUS in January 2025 about structured outputs and how they enable faster iteration and testing when building language models. I've written up a more detailed version of the talk here as well as provided the slides below.

LLM applications in 2025 face a unique challenge: while they enable rapid deployment compared to traditional ML systems, they also introduce new risks around reliability and safety.

In this article, I'll explain why structured outputs remain crucial for building robust LLM applications, and how they enable faster iteration and testing.

December 4, 2024
in LLMs, Evals
8 min read

Are your eval improvements just pure chance?

A step that's often missed when benchmarking retrieval methods is determining if any performance difference is due to random chance. Without this crucial step, you might invest in a new system upgrade that's outperformed your old one by pure chance.

If you're comparing retrieval methods, you'll often want to know if the improvements you're seeing are due to random chance.

In this article, we'll use a simple case study to demonstrate how to answer this question, introducing a new library called indomee (a playful nod to both "in-domain" evaluation and the beloved instant noodle brand in Southeast Asia) that makes this analysis significantly easier.

We'll do so in three steps:

First we'll simulate some fake data using numpy
Then we'll demonstrate how to do bootstrapping using nothing but numpy before visualising the results with matplotlib
Finally we'll perform a paired t-test to determine if the differences are statistically significant

November 29, 2024
in Synthetic Data, Evals
7 min read

Why User Intent matters the most for Synthetic Data

Introduction

I've generated millions of tokens worth of synthetic data over the last few weeks, and I've learned something surprising: everyone talks about using different personas or complex question structures when creating synthetic data, but they're missing what really matters.

The most important thing is actually understanding why users are asking their questions in the first place - their intent.

Let's explore this concept using Peek, an AI personal finance bot, as our case study.

By examining how synthetic data generation evolves from basic documentation-based approaches to intent-driven synthesis, we'll see why focusing on user intent produces more valuable training data.

September 5, 2024
in Evals, Braintrust
8 min read

Getting Started with Evals - a speedrun through Braintrust

For software engineers struggling with LLM application performance, simple evaluations are your secret weapon. Forget the complexity — we'll show you how to start testing your LLM in just 5 minutes using Braintrust. By the end of this article, you'll have a working example of a test harness that you can easily customise for your own use cases.

We'll be using a cleaned version of the GSM8k dataset that you can find here.

Here's what we'll cover:

Setting up Braintrust
Writing our first task to evaluate an LLM's response to the GSM8k with Instructor
Simple recipes that you'll need

September 25, 2023
in LLMs, Evals
20 min read

Reinventing Gandalf

Introduction

The code for the challenge can be found here

A while ago, a company called Lakera released a challenge called Gandalf on Hacker News which took the LLM community by storm. The premise was simple - get a LLM that they had built to reveal a password. This wasn't an easy task and many people spent days trying to crack it.

Some time after their challenge had been relased, they were then kind enough to release both the solution AND a rough overview of how the challenge was developed. You can check it out here. Inspired by this, I figured I'd try to reproduce it to some degree on my own in a challenge I called The Chinese Wall with Peter Mekhaeil for our annual company's coding competition. We will be releasing the code shortly.

Participants were asked to try and extract a password from a LLM that we provided. We also provided a discord bot that was trained on the challenge documentation which participants could use to ask questions to.

Here's a quick snapshot of it in action

The model uses Open AI's GPT 3.5 under the hood with the instructor library for function calls.