Synthetic Data

January 27, 2025
in Evals, Instructor, Synthetic Data
8 min read

Why Structured Outputs matter for LLM Applications in 2025

I gave a short talk at NUS in January 2025 about structured outputs and how they enable faster iteration and testing when building language models. I've written up a more detailed version of the talk here as well as provided the slides below.

LLM applications in 2025 face a unique challenge: while they enable rapid deployment compared to traditional ML systems, they also introduce new risks around reliability and safety.

In this article, I'll explain why structured outputs remain crucial for building robust LLM applications, and how they enable faster iteration and testing.

November 29, 2024
in Synthetic Data, Evals
7 min read

Why User Intent matters the most for Synthetic Data

Introduction

I've generated millions of tokens worth of synthetic data over the last few weeks, and I've learned something surprising: everyone talks about using different personas or complex question structures when creating synthetic data, but they're missing what really matters.

The most important thing is actually understanding why users are asking their questions in the first place - their intent.

Let's explore this concept using Peek, an AI personal finance bot, as our case study.

By examining how synthetic data generation evolves from basic documentation-based approaches to intent-driven synthesis, we'll see why focusing on user intent produces more valuable training data.

November 23, 2024
in LLMs, Synthetic Data
6 min read

Synthetic Data is no Free Lunch

I spent some time playing with a new framework called Dria recently that uses LLMs to generate synthetic data. I couldn't get it to work but I did spend some time digging through their source code, and I thought I'd share some of my thoughts on the topic.

Over the past few weeks, I've generated a few million tokens of synthetic data for some projects. I'm still figuring out the best way to do it but I think it's definitely taught me that it's no free lunch. You do need to spend some time thinking about how to generate the data that you want.

The Premise

An example

When I first started generating synthetic data for question-answering systems, I thought it would be straightforward - all I had to do was to ask a language model to generate a few thousand questions that a user might ask.

August 27, 2024
in LLMs, Synthetic Data
5 min read

How to create synthetic data that works

Synthetic data can accelerate AI development, but generating high-quality datasets remains challenging. In this article, I'll walk through a few experiments I've done with synthetic data generation and the takeaways I've learnt so that you can do the same.

We'll do by covering

Limitations of simple generation methods : Why simple generation methods produce homogeneous data
Entropy and why it matters : Techniques to increase diversity in synthetic datasets
Practical Implementations : Some simple examples of how to increase entropy and diversity to get better synthetic data