Skip to main content

Command Palette

Search for a command to run...

Smart LLM Routing

Updated
4 min read

Building LLM apps is easy, but scaling them without setting a pile of money on fire is hard. You really don't need the massive brainpower of GPT-5 for every single user query.

Routing is how we fix this. The concept is simple: dynamically direct incoming prompts to the most appropriate model based on the query itself.

This solves the classic engineering trilemma by saving expensive tokens for hard problems while reducing latency for simple tasks. I've been tinkering with different routing architectures lately, and I've noticed they naturally build on top of each other across four distinct levels of complexity.

Level 1: Just use an if-statement

The absolute simplest approach is using hardcoded logic to route the prompt before it ever touches a neural network. My favorite technique here is context length routing.

You just count the tokens before generation. If it's under 8k, I route it to a local Llama 3 8B instance running on my Mac. If it's over 100k, I hand it off to a model with a massive context window like Gemini 3.5 Pro.

You can also use regex to scan for keywords like "SQL" or "Python" to instantly trigger a specialized coding model. It costs nothing and has zero latency.

The catch? It’s incredibly fragile. A prompt like "Tell me a joke about Python" will falsely trigger your coding route, which means we need a slightly smarter approach to understand intent.

Level 2: Embedding-based semantic routing

To fix the fragility of regex, we have to move beyond exact keyword matches and actually evaluate the meaning of the prompt. This is where semantic routing steps in.

You define routing paths using a handful of exemplar sentences and convert them into vector embeddings. When a new query hits your API, you embed it and calculate the cosine similarity against your predefined paths.

I highly recommend checking out libraries like semantic-router for this. It's surprisingly fast if you run a tiny embedding model like all-MiniLM-L6-v2 locally on your CPU.

But as smart as semantic routing is, it still has a ceiling. Your accuracy depends entirely on maintaining a vector space and curating great exemplar data, which eventually becomes a maintenance bottleneck.

Level 3: Using a model to pick the model

When embeddings aren't enough, you can use an actual machine learning model to act as a traffic cop. The classic approach is fine-tuning a lightweight transformer like BERT to classify queries into specific intent buckets.

If you don't have a dataset to train BERT yet, you can use the "LLM-as-a-router" fallback. Just ask a fast, cheap model like Claude Haiku or Gemini Flash to read the query and output a JSON route.

A prompt like "Categorize this query as MATH, CREATIVE, or CHAT and output only the category name" works wonders. Constrained generation keeps the router from hallucinating.

Here is my favorite trick for this level. Use that LLM router to log 10,000 synthetic routing decisions, then use those logs to fine-tune a tiny BERT model so you can drop your routing latency back to practically zero.

Level 4: Cascading and escalation

Even with a perfect classifier, picking a single model upfront isn't always the right move. Cascading routing fixes this by dynamically escalating to a more capable model mid-flight.

In a single-turn setup, you send the query to the cheapest model first and have it score its own output. If the confidence score is low, you throw away the bad generation and escalate the prompt to a heavier frontier model.

You can also do this statefully across a conversation. Start a chat session with a small, local model and monitor the state to see if it gets stuck.

If the conversation hits five continuous turns without resolving, or the user repeatedly types "No, that's not what I meant," you pull the ripcord. You pass the entire context history to an advanced model like GPT-5 to step in, figure out the mess, and finish the job.