The Cost Problem with Enterprise AI
When you run dozens of AI agents processing thousands of requests daily, API costs add up fast. A single premium-tier model call can cost 10–50x more than a lightweight local model. Yet most enterprises either route everything through an expensive model (burning budget) or use a cheap model for everything (sacrificing quality). Neither approach works at scale.
At ESKOM.AI, we solved this with 8-tier LLM routing — a system that automatically matches each request to the most cost-effective model capable of handling it. The result: 70% cost reduction compared to routing everything through a top-tier model, with no measurable drop in output quality for production tasks.
How 8-Tier Routing Works
Every incoming request is analyzed for complexity, domain requirements, and required output quality before it reaches any LLM. The routing engine considers factors like token count, reasoning depth, tool-use requirements, and the requesting agent's quality threshold. Here's a simplified view of our tiers:
- Tier 1 (Free) — lightweight open-source models running locally. Handles simple classifications, keyword extraction, and data formatting. Zero API cost.
- Tiers 2–3 (Low cost) — Larger open-source models (8B–70B parameters) on local GPU. Good for summarization, translation, and structured data extraction.
- Tiers 4–5 (Medium) — Mid-tier cloud models. Balanced cost-performance for most business tasks.
- Tiers 6–7 (High) — Advanced cloud models. Complex reasoning, multi-step analysis, code generation.
- Tier 8 (Premium) — Top-tier premium models. Reserved for critical decisions: legal analysis, financial modeling, architectural design, CEO-facing outputs.
The Intelligence Behind Routing
The routing decision isn't a simple keyword lookup. Our classifier evaluates each request across multiple dimensions: reasoning complexity (does it need chain-of-thought?), factual precision (can it hallucinate safely or must it be exact?), output format (free text vs. structured JSON), and business criticality (internal draft vs. client-facing document). The classifier itself runs on a lightweight model, adding negligible latency.
Critically, agents can override the router. When our CFO agent processes a quarterly financial report, he always escalates to Tier 7–8 regardless of apparent complexity. Domain-specific overrides ensure that business context trumps algorithmic classification.
Measuring What Matters
We track routing effectiveness through three metrics: cost per resolved task (not per API call), quality score (human-rated sample of outputs), and escalation rate (how often a lower-tier response gets rejected and re-routed upward). After six months in production, our escalation rate sits below 3%, meaning the router correctly identifies the right tier 97% of the time. For enterprises considering multi-model strategies, the lesson is clear: intelligent routing isn't optional — it's the difference between sustainable AI operations and runaway costs.