Cutting startup work without touching the kernel

A controlled look at how choosing a lighter representation of the same workload cut application-level startup work from 4.45s to 1.00s — and exactly where that optimization stops.

Chan MengMay 26, 20262 min read

When a compute workload starts cold, a surprising amount of its startup time goes into work that has nothing to do with producing the answer: importing heavy libraries, deserializing model objects, and rebuilding in-memory structures that were thrown away when the last process exited.

We wanted to know how much of that we could remove at the application level — without privileged containers, custom node pools, or kernel-level checkpointing.

The same answer, lighter representations

The key observation is that one workload can produce identical output from progressively lighter representations of itself. A model loaded through its full framework is the heaviest form. The same model expressed as plain numerical arrays is lighter. Expressed as pure standard-library data and arithmetic, it is lighter still — and, for the right class of workload, it returns the same result.

Walking down that ladder, in a controlled measurement, application-level startup went from 4.45s to 1.00s — a 78% reduction in application-level work — with the output validated as equivalent at each step.

The boundary, stated plainly

These are real, controlled, application-level measurements — repeatable, not a mockup. They are not a production SLA and not a complete end-to-end activation time. The 78% describes application-level work only; it does not remove the infrastructure cost that surrounds it. Pod scheduling, container creation, and image layers still dominate the full cold path, so the end-to-end improvement is far smaller than the application-level number alone suggests.

Saying that out loud matters. A measurement describes a setup; strip the setup away and you are no longer reporting a result, you are implying one you never tested.

Finding the floor

The lightest representation — pure standard library, no numerical framework — is also where this technique stops. Below it, the remaining time is the language interpreter booting and the container being created. You cannot shave those at the application layer; going further means kernel-level checkpoint/restore or deeper infrastructure changes, which carry their own cost and operational trade-offs.

That is the useful conclusion, not a disappointment. Knowing precisely where application-level optimization ends tells you exactly which problems belong to the app and which belong to the platform — and lets you spend effort where it actually moves the number, with the boundary attached every time you report it.

Chan Meng

Founding Principal Engineer — Activation, Execution and AI Systems

Activation architecture, execution systems, AI-assisted orchestration, technical proof development, telemetry, and system hardening.

Cutting startup work without touching the kernel

The same answer, lighter representations

The boundary, stated plainly

Finding the floor

Keep reading

What we mean by "restore"

How we build: measurement over claims

Preserving AI state across a pod death

See the activation layer in action