Easy methods to Replace LLM Weights with No Downtime

Think about attempting to renovate the muse of a towering skyscraper with out asking its occupants to depart or pause their work. That’s precisely what MoonshotAI’s Checkpoint Engine does for AI fashions. It permits large language fashions to replace their brains, the weights, whereas nonetheless working, so there’s no downtime. This breakthrough lets builders enhance their AI shortly and effectively, even on fashions with over a trillion parameters working on 1000’s of GPUs. It’s quick, dependable, and designed to maintain AI techniques working easily whereas evolving in real-time, making it a significant instrument for cutting-edge AI purposes. This text goes over what it’s, the way it works, and why it issues for the way forward for large-scale AI techniques.

What’s Moonshot AI’s Checkpoint engine?

Moonshot AI’s Checkpoint Engine is a specialised middleware designed to replace the weights of enormous language fashions (LLMs) in real-time throughout inference with out interrupting ongoing operations. This functionality is crucial in Reinforcement studying eventualities the place mannequin weights have to be up to date often. The Checkpoint Engine at the moment integrates seamlessly with vLLM inference frameworks and provides optimized efficiency by means of pipelining and reminiscence administration strategies. It additionally gives options like reusing weights from current situations to scale back overhead in scaling eventualities.

Structure

The core of the Checkpoint is the ParameterServer class, which handles the load replace logic and orchestrates the information move.

H2D(Host to System): Strikes up to date weights from CPU reminiscence or storage to GPU reminiscence, utilizing optimized switch pipelines.
Broadcast: Distributes the load throughout all inference engine situations effectively, leveraging CUDA IPC buffers for shared reminiscence communication.
Reload: Every inference engine then selectively reloads related weight shards from the broadcasted information based on its sharding sample.

These three-stage pipelines guarantee environment friendly, overlapping communication and copying for pace.

When GPU reminiscence is restricted, the system can fall again to serial execution to take care of reliability.

Strategies Used

The Checkpoint Engine makes use of two important strategies to replace mannequin weights throughout inference.

Broadcast Methodology: That is the quickest and the default method. That is very best when numerous inference situations have to be up to date concurrently. It broadcasts the up to date weights from CPU reminiscence to all inference GPUs synchronously, making certain all situations keep completely in sync with minimal delay.
P2P (Peer-to-Peer) Methodology: It’s used when inference situations are added or eliminated dynamically throughout runtime. It avoids disrupting current inference workloads by sending weights immediately from CPUs in current situations to GPUs in new situations by means of a peer-to-peer switch system, permitting clean and versatile updates.

Working

The Checkpoint Engine orchestrates the complete switch course of. It first gathers mandatory metadata to create a plan, together with deciding the correct bucket measurement for information switch. Then, it executes the switch, controlling the inference engine by means of a ZeroMQ socket to maximise efficiency. It organizes information switch into pipelines with overlapped communication and duplicate, enabling quick and environment friendly weight updates even beneath heavy workload.

By implementing the above-mentioned strategies and structure, the Checkpoint Engine permits dwell weight updates for LLMs throughout 1000’s of GPUs with minimal latency and repair disruption.

Set up and Utilization

Set up

To make use of the quickest broadcast

Use Code:

pip set up checkpoint-engine

To make use of the versatile P2P implementation:

Use Code:

pip set up 'checkpoint-engine[p2p]'

This may set up mooncake-transfer-engine to help RDMA switch between completely different ranks.

Instance Use case

Step 1:

Put together an H800 or H20 machine with 8 GPUs with the most recent vLLM. Be sure you embody /collective_rpc API endpoint commit (accessible in the principle department) since checkpoint-engine will use this endpoint to replace weights.

Step 2:

set up checkpoint-engine

Code:

uv pip set up 'checkpoint-engine[p2p]'

Step 3:

For our use case, we’re gonna use Qwen/Qwen3-235B-A22B-Instruct-2507 because the check mannequin.

Code:

hf obtain Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /decide/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/

Step 4:

Begin vLLM in dev mode and set –load-format dummy. Ensure that to set –worker-extension-cls=checkpoint_engine.employee.VllmColocateWorkerExtension

Code:

VLLM_SERVER_DEV_MODE=1 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 19730 --trust-remote-code 

    --tensor-parallel-size=8 --max-model-len 4096 --load-format dummy 

    --served-model-name checkpoint-engine-demo --model /decide/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/ 

    --worker-extension-cls checkpoint_engine.employee.VllmColocateWorkerExtension

To replace weights by checkpoint-engine. No want to attend for vLLM to prepare. Use the code under.

Code:

torchrun --nproc-per-node 8 examples/replace.py --update-method all --checkpoint-path /decide/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/

To reuse weights from current situations

New checkpoint-engine situations can be a part of current situations and reuse their weights.

Utilizing the tactic under:

Step 1: Begin the present occasion with –save-metas-file global_metas.pkl to avoid wasting world metas to a file.

Step 2: Use –sleep-time 300 to ensure they keep alive.

Code:

torchrun --nproc-per-node 8 examples/replace.py --checkpoint-path $MODEL_PATH 

    --sleep-time 300 --save-metas-file global_metas.pkl

Step 3: After a checkpoint is registered, new situations can acquire a duplicate of the checkpoint by setting –load-metas-file global_metas.pkl

Code:

torchrun --nproc-per-node 8 examples/replace.py --load-metas-file global_metas.pkl

FP8 quantization

At present, FP8 quantization doesn’t work in vLLM when updating weights. It makes use of a easy patch in patches/vllm_fp8.patch to deal with the proper weight replace. Additionally ,this patch is just examined in DeepSeek-V3.1 and Kimi-K2. So there are probabilities of having some compatibility points with different fashions.

Take a look at

Run a easy correctness check for checkpoint_engine

Code:

torchrun --nproc-per-node 8 assessments/test_update.py

Benchmark

Mannequin	System Setup	Metadata Gathering	Replace (Broadcast)	Replace (P2P)
GLM-4.5-Air (BF16)	8x H800 TP8	0.17 seconds	3.94 seconds (1.42 GiB)	8.83 seconds (4.77 GiB)
Qwen3-235B-A22B-Instruct-2507 (BF16)	8x H800 TP8	0.46 seconds	6.75 seconds (2.69 GiB)	16.47 seconds (4.05 GiB)
DeepSeek-V3.1 (FP8)	16x H20 TP16	1.44 seconds	12.22 seconds (2.38 GiB)	25.77 seconds (3.61 GiB)
Kimi-K2-Instruct (FP8)	16x H20 TP16	1.81 seconds	15.45 seconds (2.93 GiB)	36.24 seconds (4.46 GiB)
DeepSeek-V3.1 (FP8)	256x H20 TP16	1.40 seconds	13.88 seconds (2.54 GiB)	33.30 seconds (3.86 GiB)
Kimi-K2-Instruct (FP8)	256x H20 TP16	1.88 seconds	21.50 seconds (2.99 GiB)	34.49 seconds (4.57 GiB)

Insights

Listed below are just a few observations that I’ve made:

The printed methodology usually provides the quickest replace time, optimized for synchronous weight updates throughout many inference situations.
The P2P methodology takes longer however permits dynamic updates when situations be a part of or go away throughout runtime.
These benchmark exhibits the scalability of Checkpoint Engine, dealing with a trillion parameter fashions effectively on clusters starting from 8 to 256 GPUs

Limitations of Checkpoint Engine

Whereas Checkpoint Engine is a strong answer for dwell weight updates in LLMs, it at the moment has some limitations.

Works Greatest with vLLM for Now: The engine is especially examined with the vLLM framework. For those who’re hoping to make use of it with different AI frameworks or customized setups, you would possibly want some additional work to get it working easily.
Pipeline Nonetheless Bettering: The best seamless pipeline that overlaps information strikes completely isn’t totally completed but. This implies there’s nonetheless potential to make the updates even quicker.
P2P Replace May Be Smoother: The peer-to-peer methodology sends information by means of a bottleneck at one important node earlier than sharing it with others, which might sluggish issues down when you have got numerous GPUs.
Wants Additional GPU Reminiscence: The intelligent broadcast system makes use of extra GPU reminiscence to hurry issues up. On machines with much less reminiscence, it switches to a slower, much less environment friendly course of.
Restricted Help for FP8 Fashions: For those who’re working with the newer FP8 quantized fashions, you’ll want some experimental patches. And even then, not all fashions play properly, but past a few examined ones.

Conclusion

Moonshot AI’s Checkpoint Engine is a game-changer for updating large AI fashions with out stopping them. It retains every part working easily, even whereas the mannequin’s “mind” is getting smarter in real-time. Whereas it nonetheless has just a few areas to enhance, the potential is big. For those who’re working with giant AI techniques, this instrument is certainly price watching. It’s serving to make the way forward for AI quicker and extra environment friendly, with none downtime.

Continuously Requested Questions

Q1. What drawback does Checkpoint Engine resolve?

A. It lets giant language fashions replace weights in real-time throughout inference with out downtime, so AI techniques keep on-line whereas bettering.

Q2. Which frameworks does Checkpoint Engine help?

A. Proper now, it’s primarily built-in and examined with the vLLM inference framework.

Q3. What’s the distinction between Broadcast and P2P strategies?

A. Broadcast is quicker for synchronized updates throughout many GPUs, whereas P2P permits versatile updates when situations be a part of or go away.

I’m a Information Science Trainee at Analytics Vidhya, passionately engaged on the event of superior AI options similar to Generative AI purposes, Massive Language Fashions, and cutting-edge AI instruments that push the boundaries of know-how. My position additionally includes creating partaking instructional content material for Analytics Vidhya’s YouTube channels, creating complete programs that cowl the total spectrum of machine studying to generative AI, and authoring technical blogs that join foundational ideas with the most recent improvements in AI. By this, I purpose to contribute to constructing clever techniques and share information that conjures up and empowers the AI group.

Login to proceed studying and luxuriate in expert-curated content material.

Supply hyperlink

What's Hot

AI Tools for Education Leaders: The 2025 Leadership Toolkit

Chinese language leek-derived extracellular vesicles ameliorate sarcopenia by regulating mitochondrial biogenesis and autophagy by way of AMPK and sustaining myosin homeostasis | Journal of Nanobiotechnology

Gemini 3 is Right here! The Most Highly effective AI Mannequin Out There

Easy methods to Replace LLM Weights with No Downtime

AI Tools for Education Leaders: The 2025 Leadership Toolkit

Chinese language leek-derived extracellular vesicles ameliorate sarcopenia by regulating mitochondrial biogenesis and autophagy by way of AMPK and sustaining myosin homeostasis | Journal of Nanobiotechnology

Gemini 3 is Right here! The Most Highly effective AI Mannequin Out There

AI Tools for Education Leaders: The 2025 Leadership Toolkit

Chinese language leek-derived extracellular vesicles ameliorate sarcopenia by regulating mitochondrial biogenesis and autophagy by way of AMPK and sustaining myosin homeostasis | Journal of Nanobiotechnology

Gemini 3 is Right here! The Most Highly effective AI Mannequin Out There

Advancing Cybersecurity for Microsoft Environments – Sophos Information

About Us

Links

Resources

What's Hot

Easy methods to Replace LLM Weights with No Downtime

What’s Moonshot AI’s Checkpoint engine?

Structure

Strategies Used

Working

Set up and Utilization

Set up

Instance Use case

To reuse weights from current situations

FP8 quantization

Benchmark

Insights

Limitations of Checkpoint Engine

Conclusion

Continuously Requested Questions

Login to proceed studying and luxuriate in expert-curated content material.

Related Posts

About Us

Links

Resources

Subscribe to Updates