[RFC] NVIDIA Model Optimizer — Product Roadmap
NVIDIA Model Optimizer provides a unified library of SOTA model optimization techniques — quantization, sparsity, pruning, distillation, NAS, and speculative decoding — that compress deep learning models for downstream deployment frameworks.
This document outlines our investment areas and upcoming work. Details are subject to change and we'll update this roadmap regularly. We welcome questions and feedback in this thread and feature requests in GitHub Issues.
We organize upcoming work into three horizons: Now (Q2 – Q3 2026), and early plans for Next (Q4 2026). Specific items may shift between horizons as work progresses.
1. Model optimization techniques
Model Optimizer collaborates with internal teams and external research labs to continuously develop and integrate state-of-the-art techniques into the library. Active areas include advanced PTQ, QAT with distillation, attention sparsity, token-efficient pruning, NAS, and distillation, mixed-precision quantization, and emerging speculative-decoding techniques.
Now (Q2 – Q3 2026)
Next (Q4 2026)
2. Framework architecture
Model Optimizer's architecture organizes around five components, each on its own publishing track. The goal is a layered framework where contributors can add a model or technique by working in a single component, without reading the rest of the codebase.
Optimization Lib — model-agnostic algorithms (PTQ, QAT, QAD, pruning, distillation, AWQ, numerics).
Modeling Lib — per-model modules organized like Hugging Face Transformers (models/<model>/<opt_method>.py).
Recipes Lib — first-class YAML/JSON recipes per (model × technique), shipped alongside the optimized checkpoint, with a trust_remote_code-style escape hatch for customization.
Export Lib — formally defined HF export format covering quantization, sparsity, and mixed precision, aligned with compressed-tensors.
Unified entrypoint (nv-modelopt) — single CLI replacing example scripts. nv-modelopt --model X --recipe Y --output Z. Distributed execution via torchrun or sbatch.
Now (Q2 – Q3 2026)
Next (Q4 2026)
3. Serving and runtime integration
Model Optimizer integrates with the major inference and training stacks so that ModelOpt-optimized checkpoints deploy without glue code.
Now (Q2 – Q3 2026)
Next (Q4 2026)
4. Beyond LLM — diffusion, AR+diffusion, video, image, world model, multimodal, Auto-VLA
Now (Q2 – Q3 2026)
Next (Q4 2026)
5. Agentic optimization workflows
Reusable agent skills that triage pipeline failures, onboard new models, and run PTQ → deployment → evaluation autonomously.
Now (Q2 – Q3 2026)
Next (Q4 2026)
Questions, comments, and feature requests welcome in this thread. Specific items are tracked as separate GitHub issues with the roadmap label.
[RFC] NVIDIA Model Optimizer — Product Roadmap
NVIDIA Model Optimizer provides a unified library of SOTA model optimization techniques — quantization, sparsity, pruning, distillation, NAS, and speculative decoding — that compress deep learning models for downstream deployment frameworks.
This document outlines our investment areas and upcoming work. Details are subject to change and we'll update this roadmap regularly. We welcome questions and feedback in this thread and feature requests in GitHub Issues.
We organize upcoming work into three horizons: Now (Q2 – Q3 2026), and early plans for Next (Q4 2026). Specific items may shift between horizons as work progresses.
1. Model optimization techniques
Model Optimizer collaborates with internal teams and external research labs to continuously develop and integrate state-of-the-art techniques into the library. Active areas include advanced PTQ, QAT with distillation, attention sparsity, token-efficient pruning, NAS, and distillation, mixed-precision quantization, and emerging speculative-decoding techniques.
Now (Q2 – Q3 2026)
PTQ Recipe Framework — modular pipeline for post-training quantization across LLM, VLM, and diffusion/video models. YAML recipe interface with mixed-precision support; pluggable calibration; unified Hugging Face export.
PTQ Developer Experience
QAT/QAD and Training-based PTQ methods — framework, algorithms, and efficiency infrastructure for optimization techniques that use a training pass.
Speculative decoding — recipes and examples in examples/speculative_decoding/. Coverage list maintained alongside the NVFP4 tracker.
dLLM (diffusion LLM) support — NVFP4 PTQ recipes for diffusion-based language models, including hybrid precision schemes.
NAS and Pruning — two complementary frameworks for compressing models via structural pruning and distillation.
Puzzletron — heterogeneous pruning across any model architecture, with M-Bridge distillation:
Minitron — LLM-focused parameter NAS with short distillation:
Next (Q4 2026)
2. Framework architecture
Model Optimizer's architecture organizes around five components, each on its own publishing track. The goal is a layered framework where contributors can add a model or technique by working in a single component, without reading the rest of the codebase.
Optimization Lib — model-agnostic algorithms (PTQ, QAT, QAD, pruning, distillation, AWQ, numerics).
Modeling Lib — per-model modules organized like Hugging Face Transformers (models/<model>/<opt_method>.py).
Recipes Lib — first-class YAML/JSON recipes per (model × technique), shipped alongside the optimized checkpoint, with a trust_remote_code-style escape hatch for customization.
Export Lib — formally defined HF export format covering quantization, sparsity, and mixed precision, aligned with compressed-tensors.
Unified entrypoint (nv-modelopt) — single CLI replacing example scripts. nv-modelopt --model X --recipe Y --output Z. Distributed execution via torchrun or sbatch.
Now (Q2 – Q3 2026)
Recipes Library — YAML recipe configs as first-class artifacts, with composable layers for algorithm, calibration dataset, numerics, and customizations.
Modeling Library — per-model modules organized like Hugging Face Transformers; LLM dynamic modeling code reorganized into per-model folders.
nv-modelopt unified entrypoint — single CLI for ModelOpt workflows, replacing the scattered example scripts.
Next (Q4 2026)
3. Serving and runtime integration
Model Optimizer integrates with the major inference and training stacks so that ModelOpt-optimized checkpoints deploy without glue code.
Now (Q2 – Q3 2026)
vLLM and SGLang upstream — ModelOpt established as a first-class quantization backend in vLLM and SGLang, with canonical schemes aligned with compressed-tensors.
KV cache quantization — recipe definitions and PyTorch accuracy validation for KV cache quantization schemes. Runtime deployment (memory layout, kernels) lives in vLLM, SGLang, and TRT-LLM.
Next (Q4 2026)
Hugging Face Transformers v5.0 (universal serving path) — modelopt → export → Transformers v5.0 serving, opening a third deployment path alongside vLLM and SGLang.
vLLM / SGLang sparsity and SpecDec loading — 2:4 weight sparsity, sparse attention, and SpecDec checkpoints load natively with full recipe metadata; no partner glue code required.
NVIDIA-NeMo/Megatron-Bridge — ModelOpt embedded in NeMo training and release pipelines.
NeMo-RL integration — quantization-aware distillation (QAD) integrated into the NeMo-RL post-training loop, so RL-trained models retain accuracy when quantized to NVFP4. Removes the post-hoc PTQ accuracy gap that today's RL-fine-tuned models incur on deployment.
4. Beyond LLM — diffusion, AR+diffusion, video, image, world model, multimodal, Auto-VLA
Now (Q2 – Q3 2026)
Video/Image/World Model optimization — quantization, sparsity, and distillation recipes in ModelOpt for video diffusion models.
VLM PTQ — recipes and workflow supported for VLM families; pluggable image-text calibration.
VLA PTQ - recipes and workflow to quantize Auto models (Alpamayo)
Next (Q4 2026)
Video/Image/World Model — expanded recipes
VLM — expanded recipes
VLA - expanded recipes
VLA QAT/QAD — recipes for production VLM family deployment.
[Experimental] VLA pruning / distillation
5. Agentic optimization workflows
Reusable agent skills that triage pipeline failures, onboard new models, and run PTQ → deployment → evaluation autonomously.
Now (Q2 – Q3 2026)
Agent PTQ skill — decision-tree routing, remote execution, Slurm / Docker support, unlisted-model handling.
Agent Eval skill — auto-detection of quantization + benchmark recommendations.
Agent Deployment skill — vLLM / SGLang / TRT-LLM OpenAI-compatible endpoint generation.
Next (Q4 2026)
Deployment debug loop — automated triage based on observed failures.
End-to-end agentic workflow documentation with CI guide and cross-skill references.
ModelOpt launcher — standalone CLI for multi-step optimization pipelines, decoupled from the full NeMo stack.
Questions, comments, and feature requests welcome in this thread. Specific items are tracked as separate GitHub issues with the roadmap label.