Adaptive Inference at the Edge: Speculative Decoding and KV-Cache Compression
The bottleneck for edge LLM workloads is usually the runtime, not just the model. air-runtime packages smart routing, speculative decoding, and KV-cache compression for constrained hardware.
air-runtime explores how to make constrained devices more capable without hand-waving away compute budgets. The work combines speculative decoding, routing, and KV-cache compression as runtime-level choices.
The goal is practical: push more useful reasoning through small hardware by treating latency, memory, and routing as product constraints instead of afterthoughts.
The runtime is the product surface
Edge LLM work is often framed as a model-selection problem, but the runtime decides whether the chosen model is usable. Memory pressure, cache behavior, routing latency, and fallback strategy all show up directly in the experience.
air-runtime treats those details as first-class design inputs. The result is a project about orchestration inside inference, not just another wrapper around a model call.
Core mechanisms
Speculative decoding is used to make generation feel faster when a smaller draft path can do useful work ahead of the larger model. KV-cache compression targets the memory cost that becomes painful on constrained hardware. Smart routing decides which path should handle a request based on the tradeoff between quality and resources.
The value is not any one trick in isolation. It is the ability to compose them into a runtime that can adapt under pressure.
Why it matters
Useful local AI depends on respecting hardware instead of pretending every user has cloud-scale compute. This project shows the kind of systems thinking I want to apply to practical AI infrastructure.
Related work
air-runtime
An inference runtime for constrained hardware that combines speculative decoding, smart routing, and KV-cache compression to make smaller devices more useful.
Core challenge: Pushing more reasoning capability through limited hardware without pretending compute budgets do not exist.
- Speculative decoding path
- KV-cache compression layer
- Runtime-level routing logic
