Skip to content

Clustering

Using Language Models to make sense of Chat Data without compromising user privacy

If you're interested in the code for this article, you can find it here where I've implemented a simplified version of CLIO without the PII classifier and most of the original prompts ( to some degree ).

Analysing chat data at scale is a challenging task for 3 main reasons

  1. Privacy - Users don't want their data to be shared with others and we need to respect that. This makes it challenging to do analysis on user data that's specific
  2. Explainability - Unsupervised clustering methods are sometimes difficult to interpret because we don't have a good way to understand what the clusters mean.
  3. Scale - We need to be able to process large amounts of data efficiently.

An ideal solution allows us to understand broad general trends and patterns in user behaviour while preserving user privacy. In this article, we'll explore an approach that addresses this challenge - Claude Language Insights and Observability ( CLIO ) which was recently discussed in a research paper released by Anthropic.

We'll do so in 3 steps

  1. We'll start by understanding on a high level how CLIO works
  2. We'll then implement a simplified version of CLIO in Python
  3. We'll then discuss some of the clusters that we generated and some of the limitations of such an approach

Let's walk through these concepts in detail.