What it means to look at your data
People always talk about looking at your data but what does it actually mean in practice?
In this post, I'll walk you through a short example. After examining failure patterns, we discovered our query understanding was aggressively filtering out relevant items. We improved the recall of a filtering system that I was working on from 0.86 to 1 by working on prompting the model to be more flexible with its filters.
There really are two things that make debugging these issues much easier
- A clear objective metric to optimise for - in this case, I was looking at recall ( whether or not the relevant item was present in the top
k
results ) - A easy way to look at the data - I like using
braintrust
but you can use whatever you want.
Ultimately debugging these systems is all about asking intelligent questions and systematically hunting for failure modes. By the end of the post, you'll have a better idea of how to think about data debugging as an iterative process.
Background
In this specific case, our filtering system takes in a user query and returns a list of products that match the user's query. We can think of it as follows.
graph LR
%% Styling
classDef primary fill:#f8fafc,stroke:#64748b,stroke-width:2px
classDef secondary fill:#f8fafc,stroke:#94a3b8,stroke-width:1px
classDef highlight fill:#f8fafc,stroke:#64748b,stroke-width:2px
%% Main Flow
A([User Query]) --> B{Extract Filters}:::highlight
B --> |Raw Query| C([Search]):::primary
B --> |Filters| D{Validate}:::highlight
D --> C
C --> E{Post-Filter}:::highlight
E --> F([Results]):::primary
%% Filter Types
subgraph Filters[" "]
direction TB
G[Category]:::secondary
H[Price]:::secondary
I[Product Type]:::secondary
J[Attributes]:::secondary
end
%% Connect filters
E -.-> G
E -.-> H
E -.-> I
E -.-> J
%% Layout
linkStyle default stroke:#94a3b8,stroke-width:1px
We can represent these filters using a Pydantic schema as seen below where we have category/subcategory specific attributes Eg. sleeve length, fit etc
class Attribute(BaseModel):
name: str
values: list[str]
class QueryFilters(BaseModel):
attributes: list[Attribute]
material: Optional[list[str]]
min_price: Optional[float] = None
max_price: Optional[float] = None
subcategory: str
category: str
product_type: list[str]
occasions: list[str]
The goal of the system is to return a list of products that match the user's query. The system is evaluated using the following metrics:
- Recall@k - whether or not the relevant item was present in the top
k
results - MRR@k - the mean reciprocal rank of the relevant item in the top
k
results
Here's a simplified hierachy of what our catalog looks like - in actuality we have a much more complex hierarchy with many more categories and subcategories.
Category: Women
└── Subcategory: Bottoms
└── Product Types:
- Jeans
- Pants
- Shorts
└── Subcategory: Tops
└── Product Types:
- Blouses
- T-Shirts
- Tank Tops
Initial Implementation: Finding Failure Patterns
Our initial metrics looked promising but had clear room for improvement:
Metric | Filtered Results | Semantic Search |
---|---|---|
recall@3 | 79.00% | 60.53% |
recall@5 | 79.00% | 71.05% |
recall@10 | 86.00% | 84.21% |
recall@15 | 86.00% | 86.84% |
recall@25 | 86.00% | 94.74% |
If we were to look at the raw numbers, we might be tempted to think that structured extraction is the problem - in other words, perhaps we shouldn't apply the filters at all.
But that's the wrong way to think about it. We need to understand why we were failing and ultimately using structured metadata filters ensures we serve up reliable and accurate results that conform to what users want.
So I started looking at specific examples where the relevant item failed to appear in the top k
results.
Query: "Looking for comfortable, high-rise pants... prefer durable denim..."
Actual Item:
- Category: Women
- Subcategory: Bottoms
- Product Type: Jeans
- Title: "High-Waist Blue Jeans"
Generated Filters:
category: Women
subcategory: Bottoms
product_type: [Pants] # Would you make this mistake as a human?
This failure is interesting because a human would never make this mistake. When someone asks for "pants in denim," they're obviously open to jeans. This insight led to our first iteration.
First Iteration: Think Like a Human
We broadened category matching to mirror human thinking and added to our structured extraction prompt additional rules as seen below. These would make the filters more relaxed and help to catch cases where the user might be a bit too vague.
- When users mention "pants", include both pants and jeans
- When users ask for "tops", include both blouses and tank tops
Here's how it looked in practice:
Query: "Looking for a comfy cotton top for summer..."
Actual Item:
- Category: Women
- Subcategory: Tops
- Product Type: Blouses
- Title: "Sleeveless Eyelet Blouse"
Previous Filters: [Tank Tops] # Too literal
Updated Filters: [Tank Tops, Blouses] # More human-like reasoning
This change improved our metrics:
Metric | Updated Results | Previous Results |
---|---|---|
recall@3 | 81.58% | 79.00% |
recall@5 | 81.58% | 79.00% |
recall@10 | 89.47% | 86.00% |
recall@15 | 89.47% | 86.00% |
recall@25 | 89.47% | 86.00% |
We've managed to increase recall by 3% - not a huge improvement but it's a start. So we looked a few of the other failing examples to see if we could improve recall further.
Trigger-Happy Model Filters
Our next set of failures revealed an interesting pattern:
Query: "Looking for women's bottoms... in denim"
Actual Item:
- Category: Women
- Subcategory: Bottoms
- Product Type: Shorts
- Title: "Classic Denim Shorts"
Generated Filters:
product_type: [Jeans] # Why assume a specific type?
This led to a new insight - our model was choosing a specific product type at each step even though users didn't specify it at all. While the resonse model itself allows for multiple product types ( or an empty list to indicate that everything is acceptable ), the structured extraction model was being too strict.
Therefore, we added the following rule to the structured extraction prompt so that it would be more flexible with its filters.
Only choose a product type if the user explicitly mentions it -
otherwise, stay at the category level
The results were dramatic:
Metric | Final Results | Initial Results |
---|---|---|
recall@3 | 92.11% | 79.00% |
recall@5 | 92.11% | 79.00% |
recall@10 | 100.00% | 86.00% |
recall@15 | 100.00% | 86.00% |
recall@25 | 100.00% | 86.00% |
Conclusion
When looking at your data, it's important to realise that the goal is not to chase a perfect 100% score. Language Models are inherently probablistic systems and we should be looking to understand where they tend to trip up.
By looking for specific patterns in our evaluation data, we can systematically improve our system and work towards a more robust application. Each iteration should look at specific failure modes and test a hypothesis about why the system is failing.
By slowly but surely tackling these issues, we can add guardrails or safeguards to our system so that it becomes more reliable and robust over time.