Vectorless RAG
If you’ve built anything with LLMs in the past couple of years, you’ve probably wired up a Retrieval-Augmented Generation (RAG) pipeline. The playbook is burned into our brains: take a PDF, smash it into 512-token chunks, compute embeddings, shove them into a vector DB, and run a cosine similarity search when a user asks a question.
It works... until it doesn’t.
I’ve been banging my head against the wall with traditional RAG lately, especially on dense technical documentation. Blindly slicing a document into "chunks" obliterates the author's narrative flow. Worse, semantic similarity is a terrible proxy for factual relevance. Just because a chunk sounds like the query doesn't mean it holds the answer.
Lately, I’ve been experimenting with a totally different approach: Vectorless RAG (sometimes called Reasoning-based RAG). It throws out the vector database entirely. Instead of static math, it uses an LLM to perform agentic, context-aware retrieval.
Here’s a breakdown of how it works, what the trade-offs are, and how this exact same pattern is quietly taking over codebase search tools like Claude Code.
How Vectorless RAG Works
Vectorless RAG treats retrieval as an iterative reasoning task. It’s basically teaching an LLM how to read a book: look at the table of contents, find the right chapter, read it, and see if you have the answer.
Phase 1: The "In-Context" Tree Index
Instead of artificial chunking, we parse the document into a semantic, JSON-based hierarchy—essentially a highly detailed Table of Contents. This tree structure lives right in the LLM's context window.
Nodes: Chapters or sections become nodes.
Metadata: Every node gets a
node_id, a title, a brief summary, and pointers to the raw data (like page or line numbers).Hierarchy: Nodes contain sub-nodes, mapping out the whole document recursively.
Because we chunk by meaning (sections/chapters) rather than arbitrary token counts, we avoid context fragmentation entirely.
Phase 2: The Agentic Retrieval Loop
When a query comes in, the agent doesn't embed it. It reads the tree and executes a loop:
Read the ToC: Fetch the tree (just the structure and summaries, not the full text).
Reasoning: Evaluate the user's intent. Which node logically contains the answer?
Extract: Fetch the exact, unfragmented text for that specific
node_id.Evaluate: Ask: "Is this enough to answer the question?" If yes, generate the response. If no, go back to step 1 and pick a different node based on what was just learned.
The Claude Code Parallel: Vectorless Codebase Search
The shift away from vector DBs isn't just for PDFs. I've noticed the exact same architectural shift happening in developer tools. Look at how Anthropic's Claude Code navigates massive local repositories. It doesn't rely on embedded code snippets; it operates as an agent (Understand → Plan → Act → Verify).
Here is how Claude Code mirrors the Vectorless RAG pattern:
| RAG Concept | Claude Code Implementation |
|---|---|
| Semantic Initialization | Parses package.json/Cargo.toml to build a dependency graph; recursively hunts for CLAUDE.md files to bootstrap architectural rules without loading the whole repo. |
| High-Speed Discovery | Ditches semantic search for fast bash utilities: uses bfs for structural mapping and ugrep for near-zero latency string matching. |
| Code Intelligence | Doesn't just match text; uses LSP-backed intelligence (AST parsing) to "jump to definition" and trace actual execution flows deterministically. |
| Context Management | Aggressively prunes noise. If a search returns hundreds of hits, it auto-compacts the logs down to core function signatures to save context tokens. |
The Trade-Offs: Is it worth it?
Vectorless RAG solves the semantic mismatch problem, but it introduces new constraints. Here is the pragmatic breakdown.
The Good
True Relevance: Queries are about intent. An agent can deduce that "how to handle errors" maps to a specific chapter, even if the semantic overlap is low.
Zero Fragmentation: You get whole, coherent sections of text. Hallucinations drop significantly.
Handles Cross-References: Traditional RAG chokes on "see Appendix G" because the text lacks similarity to the target data. An agent just looks up Appendix G in its ToC.
Infrastructure: You can rip out your vector database entirely.
The Bad
Latency is high: A vector lookup takes milliseconds. An LLM reading a JSON tree and executing a multi-step reasoning loop takes seconds. You have to design your UI around this delay.
It gets expensive: Pumping a massive ToC into the prompt for every query, plus the tokens for the reasoning loop, burns through API credits much faster than a static vector search.
Scale Limits: You can't put the ToC of a million documents into a prompt. For massive corpora, you still need a traditional search pass to pre-filter down to a handful of relevant documents before the agent takes over.
Final Thoughts
Vectorless RAG is a fascinating shift. By treating documents like structured narratives instead of bags of embeddings, we unlock a level of precision that traditional RAG struggles to match.
While I wouldn't use it to filter Wikipedia, for deep, accurate Q&A on complex specs or codebase engineering, agentic retrieval is rapidly becoming the new standard. If you're building local coding agents or high-stakes document tools, it's time to start experimenting with reasoning loops over static vector math.