Skip to content

Evals

Are your eval improvements just pure chance?

A step that's often missed when benchmarking retrieval methods is determining if any performance difference is due to random chance. Without this crucial step, you might invest in a new system upgrade that's outperformed your old one by pure chance.

If you're comparing retrieval methods, you'll often want to know if the improvements you're seeing are due to random chance.

In this article, we'll use a simple case study to demonstrate how to answer this question, introducing a new library called indomee (a playful nod to both "in-domain" evaluation and the beloved instant noodle brand in Southeast Asia) that makes this analysis significantly easier.

We'll do so in three steps:

  1. First we'll simulate some fake data using numpy
  2. Then we'll demonstrate how to do bootstrapping using nothing but numpy before visualising the results with matplotlib
  3. Finally we'll perform a paired t-test to determine if the differences are statistically significant

Why User Intent matters the most for Synthetic Data

Introduction

I've generated millions of tokens worth of synthetic data over the last few weeks, and I've learned something surprising: everyone talks about using different personas or complex question structures when creating synthetic data, but they're missing what really matters.

The most important thing is actually understanding why users are asking their questions in the first place - their intent.

Let's explore this concept using Peek, an AI personal finance bot, as our case study.

By examining how synthetic data generation evolves from basic documentation-based approaches to intent-driven synthesis, we'll see why focusing on user intent produces more valuable training data.

Getting Started with Evals - a speedrun through Braintrust

For software engineers struggling with LLM application performance, simple evaluations are your secret weapon. Forget the complexity — we'll show you how to start testing your LLM in just 5 minutes using Braintrust. By the end of this article, you'll have a working example of a test harness that you can easily customise for your own use cases.

We'll be using a cleaned version of the GSM8k dataset that you can find here.

Here's what we'll cover:

  1. Setting up Braintrust
  2. Writing our first task to evaluate an LLM's response to the GSM8k with Instructor
  3. Simple recipes that you'll need

Reinventing Gandalf

Introduction

The code for the challenge can be found here

A while ago, a company called Lakera released a challenge called Gandalf on Hacker News which took the LLM community by storm. The premise was simple - get a LLM that they had built to reveal a password. This wasn't an easy task and many people spent days trying to crack it.

Some time after their challenge had been relased, they were then kind enough to release both the solution AND a rough overview of how the challenge was developed. You can check it out here. Inspired by this, I figured I'd try to reproduce it to some degree on my own in a challenge I called The Chinese Wall with Peter Mekhaeil for our annual company's coding competition. We will be releasing the code shortly.

Participants were asked to try and extract a password from a LLM that we provided. We also provided a discord bot that was trained on the challenge documentation which participants could use to ask questions to.

Here's a quick snapshot of it in action

The model uses Open AI's GPT 3.5 under the hood with the instructor library for function calls.