Flagship 03 · Inference

air-runtime

Adaptive inference runtime patterns for constrained hardware.

A runtime R&D project combining speculative decoding, KV-cache compression, and routing to make smaller devices more useful for local AI workloads.

Focus

Edge AI

Constraint

Limited VRAM

Layer

Runtime

What shipped

Runtime-level optimization design around constrained hardware

Composable speculative decoding and routing primitives

Memory-pressure framing through KV-cache compression

Architecture map

Speculative decoding explores faster generation paths through draft-and-verify behavior.

KV-cache compression targets memory pressure during longer interactions.

Routing logic chooses a path based on the request and available compute budget.

Context

Edge AI work becomes real when the hardware budget is fixed. Memory, latency, and model choice stop being abstract benchmarks and start shaping every interaction.

air-runtime asks what the runtime can do to stretch constrained hardware without pretending it is a cloud GPU.

Constraints

Limited memory forces explicit tradeoffs around context length and cache behavior.

Latency targets have to be felt by the user, not only reported as benchmark numbers.

Runtime complexity has to stay legible enough to debug on real devices.

Tradeoffs

Focused on runtime primitives before building a polished end-user assistant.

Preferred composable mechanisms over a single monolithic optimization.

Kept the project exploratory because hardware behavior should drive the final API.

Outcomes

Built a credible edge-inference proof point around systems constraints.

Connected model-serving ideas to practical local hardware usage.

Lessons

The runtime is where model capability becomes user experience.

Good AI infrastructure starts by admitting resource limits.