Skip to content

LLMs

Taming Your LLM Application

This is an article that sums up a talk I'm giving in Kaoshiung at the Taiwan Hackerhouse Meetup on Dec 9th. If you're interested in attending, you can sign up here

When building LLM applications, teams often jump straight to complex evaluations - using tools like RAGAS or another LLM as a judge. While these sophisticated approaches have their place, I've found that starting with simple, measurable metrics leads to more reliable systems that improve steadily over time.

Four levels of LLM Applications

I think there are four levels that teams should progress through as they build more reliable language model applications.

  1. Structured Outputs - Move from raw text to validated data structures
  2. Prioritizing Iteration - Using cheap metrics like recall/mrr to ensure you're nailing down the basics
  3. Fuzzing - Using synthetic data to systmetically test for edge cases
  4. LLM Judges - Using LLM as a judge to evaluate subjective aspects

Let's explore each level in more detail and see how they fit into a progression. We'll use instructor in these examples since that's what I'm most familiar with, but the concepts can be applied to other tools as well.

Are your eval improvements just pure chance?

A step that's often missed when benchmarking retrieval methods is determining if any performance difference is due to random chance. Without this crucial step, you might invest in a new system upgrade that's outperformed your old one by pure chance.

If you're comparing retrieval methods, you'll often want to know if the improvements you're seeing are due to random chance.

In this article, we'll use a simple case study to demonstrate how to answer this question, introducing a new library called indomee (a playful nod to both "in-domain" evaluation and the beloved instant noodle brand in Southeast Asia) that makes this analysis significantly easier.

We'll do so in three steps:

  1. First we'll simulate some fake data using numpy
  2. Then we'll demonstrate how to do bootstrapping using nothing but numpy before visualising the results with matplotlib
  3. Finally we'll perform a paired t-test to determine if the differences are statistically significant

You're probably using LLMs wrongly

In this post, I'll share three simple strategies that have transformed how I work with language models.

  1. Treat working with language models as an iterative process where you improve the quality of your outputs over time.
  2. Collect good examples that you can use as references for future prompts.
  3. Regularly review your prompts and examples to understand what works and what doesn't.

Most complaints I hear about language models are about hallucinations or bad outputs. These aren't issues with the technology itself. It's usually because we're not using these models the right way. Think about the last time you hired someone new. You didn't expect them to nail everything perfectly on day one.

The same principle applies to language models.

Synthetic Data is no Free Lunch

I spent some time playing with a new framework called Dria recently that uses LLMs to generate synthetic data. I couldn't get it to work but I did spend some time digging through their source code, and I thought I'd share some of my thoughts on the topic.

Over the past few weeks, I've generated a few million tokens of synthetic data for some projects. I'm still figuring out the best way to do it but I think it's definitely taught me that it's no free lunch. You do need to spend some time thinking about how to generate the data that you want.

The Premise

An example

When I first started generating synthetic data for question-answering systems, I thought it would be straightforward - all I had to do was to ask a language model to generate a few thousand questions that a user might ask.

You're probably not doing experiments right

I recently started working as a research engineer and it's been a significant mindset shift in how I approach my work. it's tricky to run experiments with LLMs efficiently and accurately and after months of trial and error, I've found that there are three key factors that make the biggest difference

  1. Being clear about what you're varying
  2. Investing time to build out some infrastructure
  3. Doing some simple sensitivity analysis

Let's see how each of these can make a difference in your experimental workflow.

Why Instructor might be a better bet than Langchain

Introduction

If you're building LLM applications, a common question is which framework to use: Langchain, Instructor, or something else entirely. I've found that this decision really comes down to a few critical factors to choose the right one for your application. We'll do so in three parts

  1. First we'll talk about testing and granular controls and why you should be thinking about it from the start
  2. Then we'll explain why you should be evaluating a framework's ability to experiment quickly with different models and prompts and adopt new features quickly.
  3. Finally, we'll consider why long term maintenance is also an important factor and why Instructor often provides a balanced solution, offering both simplicity and flexibility.

How to create synthetic data that works

Synthetic data can accelerate AI development, but generating high-quality datasets remains challenging. In this article, I'll walk through a few experiments I've done with synthetic data generation and the takeaways I've learnt so that you can do the same.

We'll do by covering

  1. Limitations of simple generation methods : Why simple generation methods produce homogeneous data
  2. Entropy and why it matters : Techniques to increase diversity in synthetic datasets
  3. Practical Implementations : Some simple examples of how to increase entropy and diversity to get better synthetic data

AI Engineering World Fair

What's new?

Last year, we saw a lot of interest in the use of LLMs for new use cases. This year, with more funding and interest in the space, we've finally started thinking about productionizing these models at scale and making sure that they're reliable, consistent and secure.

Let's start with a few definitions

  • Agent : This is a LLM which is provided with a few tools it can call. The agentic part of this system comes from the ability to make decisions based on some input. This is similar to Harrison Chase's article here

  • Evaluations : A set of metrics that we can look at to understand where our current system falls short. An example could be measuring precision and recall.

  • Synthethic Data Generation: Data generated by a LLM which is meant to mimic real data

Grokking LLMs

I've spent the last year working with LLMs and writing a good amount of technical content on how to use them effectively, mostly with the help of structured parsing using a framework like Instructor. Most of what I know now is self-taught and this is the guide that I wish I had when starting out.

It should take about 10-15 minutes at most to read and I've added some resources along the way that are relevant to you. If you're looking for a higher level, i suggest skimming over the first two sections and then focusing more on the application/data side of things!

I hope that after reading this essay, you walk away with an enthusiasm that these models are going to change so much things that we know today. We have models with reasoning abilities and knowledge capacities that dwarf many humans today in tasks such as Mathetical Reasoning, QnA and more.

Writing scripts that scale

Writing good scripts for machine learning is an art. I struggled with writing them for a long time because of how different it was to my experience working with full-stack frameworks such as React or FastAPI.

There were four main issues that I struggled with

  1. My job has a high probability of failing without any reason
  2. My data might not fit into memory for no reason
  3. Running a single job takes days or more
  4. Optimizing hyper-parameters is genuinely difficult