alpha demo

Save ~44% on your LLM bill by sending the easy stuff to your laptop.

The problem

Modern apps that call LLMs over-route to frontier models. A short factual question goes to GPT-4o (or Llama 3.3 70B, or whichever frontier model you picked) the same as a deep reasoning task. The bill compounds with usage that didn't actually need that capability — and there's no clean primitive that fixes this without rewriting the app's calling code.

The proposal

cost-aware-router is an OpenAI-compatible HTTP proxy. You change one line — the OpenAI client's baseURL points at http://localhost:8080/v1 — and the proxy decides per-request whether to route to a local model (Ollama, free) or to the cloud (any OpenAI-compatible provider, paid) based on a transparent heuristic. The routing rule is open, tunable, and ~50 lines of code — not magic, just a clean abstraction over a real cost lever.

Why now

Two things came together in 2025-2026: small local models (Qwen 2.5, Llama 3.2) hit 'good enough for simple stuff' for the first time, and frontier cloud pricing stayed expensive. The gap between what cloud charges and what local can do for half your traffic is real money — but most apps still send everything to the cloud because there's no easy primitive to do otherwise.

Run the benchmark

See the benchmark numbers and run it yourself.

We'll send a one-time confirmation. See privacy and terms.

Who's behind this

Built by Aiden as part of an exploration into AI-primitive ideas. This is alpha — feedback welcome.

GitHub  ·  contact