Back to writing
Edge AI10 min read

Adaptive Inference at the Edge: Speculative Decoding and KV-Cache Compression

The bottleneck for edge LLM workloads is usually the runtime, not just the model. air-runtime packages smart routing, speculative decoding, and KV-cache compression for constrained hardware.

air-runtime explores how to make constrained devices more capable without hand-waving away compute budgets. The work combines speculative decoding, routing, and KV-cache compression as runtime-level choices.

The goal is practical: push more useful reasoning through small hardware by treating latency, memory, and routing as product constraints instead of afterthoughts.

The runtime is the product surface

Edge LLM work is often framed as a model-selection problem, but the runtime decides whether the chosen model is usable. Memory pressure, cache behavior, routing latency, and fallback strategy all show up directly in the experience.

air-runtime treats those details as first-class design inputs. The result is a project about orchestration inside inference, not just another wrapper around a model call.

Core mechanisms

Speculative decoding is used to make generation feel faster when a smaller draft path can do useful work ahead of the larger model. KV-cache compression targets the memory cost that becomes painful on constrained hardware. Smart routing decides which path should handle a request based on the tradeoff between quality and resources.

The value is not any one trick in isolation. It is the ability to compose them into a runtime that can adapt under pressure.

Why it matters

Useful local AI depends on respecting hardware instead of pretending every user has cloud-scale compute. This project shows the kind of systems thinking I want to apply to practical AI infrastructure.

Related work

Flagship 03

air-runtime

Runtime R&D

An inference runtime for constrained hardware that combines speculative decoding, smart routing, and KV-cache compression to make smaller devices more useful.

Edge deployment focus

Core challenge: Pushing more reasoning capability through limited hardware without pretending compute budgets do not exist.

  • Speculative decoding path
  • KV-cache compression layer
  • Runtime-level routing logic
Edge AIInferenceKV cacheRuntime systemsPython