Skip to main content

Command Palette

Search for a command to run...

Thoughts on Advanced Chunking Strategies for RAG

Published
3 min read

I’ve been thinking a lot recently about the "chunking problem" in Retrieval-Augmented Generation. If you've played around with the llm CLI tool or built anything with Datasette and vector embeddings, you've probably hit the exact wall I described in my recent two-part series on the topic.

In the first piece, All You Need is a Good Chunking, I described the "meat cleaver" approach: slicing documents arbitrarily by character or token count. It’s the easiest way to get a prototype running, but it’s fundamentally broken. It destroys context. When you feed these broken, out-of-context chunks to an LLM, the result is hallucination and confusion.

My follow-up piece dives into how we move past the meat cleaver and pick up the scalpel. Here is how I'm thinking about the three state-of-the-art approaches I outlined:

Semantic Chunking

This is my current go-to for most of my personal projects. The math here—calculating cosine similarity between sentence embeddings to find "valleys" in the topic—is incredibly cheap to run.

If you're using an embedding model like all-MiniLM-L6-v2 locally, you don't even need a GPU; it runs blazingly fast on a Mac M-series chip. I’ve built prototypes using sqlite-vec (the successor to sqlite-vss) where I just store the individual sentences and their embeddings in a SQLite database, and then run a quick Python script to group them based on similarity drops. It's a massive upgrade over fixed-size chunking and costs fractions of a cent.

Neural Chunking

This is a fascinating approach that I haven't experimented with as much yet. Using a custom BERT model specifically trained for sequence classification and boundary detection makes a ton of sense for highly structured documents.

I really appreciate that tools like Chonkie are bundling this. The barrier to entry for running specialized NLP models used to be configuring a massive PyTorch pipeline; now it's just a pip install away. The latency overhead is a real trade-off, though, especially if you are trying to ingest documents on the fly.

Agentic Chunking

This is the absolute gold standard for quality, and it's where things get really interesting from a prompt engineering perspective.

I noted in the original post that this is "exorbitantly expensive," but the math on that is actually changing rapidly. A year ago, using GPT-4 for this would have completely blown your API budget. Today? Running an agentic "proposition extraction" pipeline using Gemini 1.5 Flash or Claude 3.5 Haiku is shockingly affordable.

I've been running experiments piping messy documents through these fast, cheap models, prompting them to rewrite the text into self-contained propositions before embedding them. It solves the "pronoun problem" (where a chunk starts with "He did it" and the embedding model has no idea who "He" is) beautifully.

The Takeaway

I strongly stand by my golden rule: Don't use Agentic chunking if Semantic gets the job done.

Start with the dumbest thing that could possibly work. If token limits fail, upgrade to semantic chunking. Only pull out the heavy LLM-based agentic chunking if you are working with incredibly messy data and your retrieval metrics prove you actually need it.