Save ~44% on your LLM bill by sending the easy stuff to your laptop.
The problem
Modern apps that call LLMs over-route to frontier models. A short factual question goes to GPT-4o (or Llama 3.3 70B, or whichever frontier model you picked) the same as a deep reasoning task. The bill compounds with usage that didn't actually need that capability — and there's no clean primitive that fixes this without rewriting the app's calling code.
The proposal
cost-aware-router is an OpenAI-compatible HTTP proxy. You change one line — the OpenAI client's baseURL points at http://localhost:8080/v1 — and the proxy decides per-request whether to route to a local model (Ollama, free) or to the cloud (any OpenAI-compatible provider, paid) based on a transparent heuristic. The routing rule is open, tunable, and ~50 lines of code — not magic, just a clean abstraction over a real cost lever.
Why now
Two things came together in 2025-2026: small local models (Qwen 2.5, Llama 3.2) hit 'good enough for simple stuff' for the first time, and frontier cloud pricing stayed expensive. The gap between what cloud charges and what local can do for half your traffic is real money — but most apps still send everything to the cloud because there's no easy primitive to do otherwise.
See the benchmark numbers and run it yourself.
Who's behind this
Built by Aiden as part of an exploration into AI-primitive ideas. This is alpha — feedback welcome.