Skip to content
All posts

Preserving AI state across a pod death

We killed the process for real, then brought the working context back. A controlled measurement of restoring ~64MB of preserved model state in ~140ms — and the boundary around it.

Chan MengChan Meng2 min read

When an AI workload's process dies, the obvious cost is the cold start that follows. The subtler cost is the working state that vanishes with it — the accumulated, in-memory context a model builds up over a session. Rebuild the process and you still have to rebuild that context from scratch, or accept that it is gone.

We wanted to know whether the part that matters could survive a true process death, and how cheaply it could come back.

Kill it for real, then bring it back

This was not a pause or a graceful hand-off. The workload was scaled to zero — a genuine process death — and then a fresh process on a new runtime was brought up in its place.

Before the process exited, it serialized its working state — the model's in-memory attention cache — and wrote it to a fast in-cluster store. On respawn, the new process read that state back and loaded it, with guards that refuse the restore unless the workload and model identity match, so a stale or mismatched snapshot can never be silently applied. The session continued from where it left off rather than starting over.

In a controlled measurement, restoring roughly 64MB of preserved state took about 140ms at the application layer, and the conversation carried on coherently across the restart.

What the number is — and is not

These are real, controlled, application-level measurements. They are not a production SLA. The ~140ms is the application-level restore cost only — the read, deserialize, and reload — and deliberately excludes the surrounding infrastructure: pod scheduling, container creation, and scale-to-ready time, which are separate and larger. The measurement covers a single active session; preserving many concurrent sessions is a different problem with its own keys and costs.

In other words: this proves that critical AI session context can survive an infrastructure restart and return in well under a fifth of a second at the application layer. It does not claim that the whole pod comes back that fast.

Application-level, on purpose

Kernel-level checkpoint/restore can preserve far more of a process, transparently — and it requires privileged execution and infrastructure changes that carry real operational weight. Our approach is deliberately scoped to the application: the workload itself decides what is worth keeping (the state that changes the answer) and saves only that. It is cheaper, safer to operate, and — because the workload names what it preserves — far easier to reason about and verify.

The honest framing is the whole point. We can show, with receipts, that preserved state comes back quickly at the application layer, and we can draw a clear line around what that does and does not include.

Chan Meng

Chan Meng

Founding Principal Engineer — Activation, Execution and AI Systems

Activation architecture, execution systems, AI-assisted orchestration, technical proof development, telemetry, and system hardening.

See the activation layer in action

Watch & Play is the live proof surface. Or bring one interactive experience and leave with measurable activation proof.