The problem with running LLMs on edge devices isn't the model — it's the runtime. Most inference stacks are designed for data centers: 40GB of HBM, 400W power envelopes, NVLink between GPUs. When you're working with a Jetson Orin or a high-end laptop with 8GB of shared VRAM, those assumptions don't hold. The model fits. The naïve runtime doesn't.
air-runtime is a Python project that packages three techniques — smart routing, speculative decoding, and KV-cache compression — into a single coherent runtime. None of these are new ideas. What's new is combining them in a way that's practical for constrained hardware, without requiring custom CUDA kernels or a PhD to configure.
Technique 1: Smart Routing
Not all queries need the same model. A request like "what's 12 times 7?" doesn't need a 70B parameter model. A request like "refactor this Rust async executor to use a work-stealing scheduler" probably does. Smart routing classifies queries by complexity and dispatches them to the smallest model capable of handling them.
The router is a lightweight classifier — a fine-tuned DistilBERT or a simple logistic regression over TF-IDF features, depending on latency budget. It outputs a complexity score that maps to one of three tiers:
- Tier 0 — small model (1B–3B params). Factual lookups, short completions, simple transformations.
- Tier 1 — medium model (7B–13B). Multi-step reasoning, code generation, summarisation.
- Tier 2 — large model (30B+). Architecture-level decisions, complex debugging, long-horizon synthesis.
In practice, roughly 60% of queries route to Tier 0, 30% to Tier 1, and only 10% reach Tier 2. Since smaller models are 5–10x faster on the same hardware, this cuts median latency dramatically without meaningfully degrading output quality on the queries that don't need the big model.
Technique 2: Speculative Decoding
Speculative decoding exploits an asymmetry in transformer inference: generating N tokens auto-regressively takes N forward passes of the large model, but verifying N tokens takes only one forward pass — because the verifier can process all positions in parallel.
The protocol:
- A small "draft" model generates a sequence of K candidate tokens quickly.
- The large "verifier" model evaluates the entire draft in a single forward pass.
- Tokens are accepted greedily from left to right until the first mismatch. The verifier corrects the first mismatched position and the cycle repeats.
The speedup depends on the draft acceptance rate. If the draft model and verifier agree on most tokens (common for well-aligned model pairs), you get close to K× speedup. For typical English prose or code, acceptance rates of 70–85% are achievable, yielding 3–4× end-to-end throughput improvement.
Technique 3: KV-Cache Compression
The KV cache is the transformer's memory. For each token in the context, attention layers store a key vector and a value vector. The cache grows linearly with context length: for a 7B model with 32 layers and 4096-dim hidden state, a 4K context consumes roughly 2GB of GPU memory. On an 8GB device, that's 25% of your total budget for context alone.
Head pruning
Not all attention heads contribute equally. Studies consistently show that 20–40% of heads can be zeroed out with less than 1% perplexity degradation. I implement a calibration pass on a small representative dataset to compute per-head importance scores, then prune the bottom quartile.
StreamingLLM attention sinks
For long-context inference, I adopt the attention sink approach from StreamingLLM: always retain the first 4 tokens of the sequence (the "sink" tokens that disproportionately accumulate attention weight) plus a sliding window of recent tokens. This bounds cache size to O(window_size) instead of O(sequence_length), enabling theoretically infinite context on fixed memory.
Results on a Jetson Orin
Benchmarking on a Jetson AGX Orin (64GB unified memory, 2048-core Ampere GPU) with a 7B quantised model (INT4):
- Baseline: 18 tokens/sec
- + Smart routing (avoiding unnecessary Tier 2 calls): 2.4× median throughput improvement
- + Speculative decoding (1B draft model): 31 tokens/sec on Tier 1 requests
- + KV compression: 40% reduction in peak memory, enabling longer contexts before OOM
Combined, the runtime achieves approximately 2.8× end-to-end throughput improvement on a mixed workload vs. naïve autoregressive generation with the large model only.