feat(pt): add custom save behaviors by OutisLi · Pull Request #5589 · deepmodeling/deepmd-kit

OutisLi · 2026-06-25T15:37:16Z

Summary by CodeRabbit

New Features
- Added configurable checkpoint output locations for training and best-validation saves.
- Added checkpoint retention by ratio, with automatic rounding and minimum retention safeguards.
Bug Fixes
- Updated checkpoint path handling so periodic and EMA checkpoints are saved consistently, including when using a custom save directory.
- Best-checkpoint files now land in the configured validation directory instead of the default location.
Documentation
- Expanded training docs and example config with the new checkpoint settings.

coderabbitai · 2026-06-25T15:46:58Z

📝 Walkthrough

Walkthrough

Checkpoint retention, checkpoint directory selection, and checkpoint save-path handling are updated for PyTorch training and validation. New configuration fields, helper functions, and tests cover save directory, best-checkpoint directory, and ratio-based retention behavior.

Changes

Checkpoint path and retention updates

Layer / File(s)	Summary
Utilities and config contract `deepmd/pt/train/utils.py`, `deepmd/utils/argcheck.py`, `doc/train/training-advanced.md`, `examples/water/dpa4/input.json`, `source/tests/pt/test_train_utils.py`	New checkpoint helpers and training config fields are added for `save_dir` and `ckpt_keep_ratio`, with docs, an example config, and helper tests updated.
Trainer retention setup `deepmd/pt/train/training.py`, `source/tests/pt/test_training.py`	`Trainer` now resolves `save_dir`, derives the keep count from `ckpt_keep_ratio` after `num_steps` is known, and updates regular and EMA retention limits.
Best checkpoint directory wiring `deepmd/pt/train/training.py`, `deepmd/pt/train/validation.py`, `deepmd/utils/argcheck.py`, `examples/water/dpa4/input.json`, `source/tests/pt/test_training.py`, `source/tests/pt/test_validation.py`	`resolve_best_checkpoint_dir` is used for full-validation checkpoint directories, `FullValidator` creates the directory during initialization, and tests cover custom best-checkpoint locations.
Checkpoint save paths and symlinks `deepmd/pt/train/training.py`, `source/tests/pt/test_training.py`	Periodic, final, and zero-step checkpoint writes now use `latest_checkpoint_path(..., save_dir)`, and tests verify the resulting files and symlinks.

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant latest_checkpoint_path
  participant save_dir
  participant checkpoint_file
  Trainer->>latest_checkpoint_path: resolve prefix, step, and save_dir
  latest_checkpoint_path-->>Trainer: checkpoint path
  Trainer->>save_dir: write periodic checkpoint file
  Trainer->>checkpoint_file: update pointer to the resolved path

sequenceDiagram
  participant Trainer
  participant resolve_best_checkpoint_dir
  participant FullValidator
  participant checkpoint_dir
  Trainer->>resolve_best_checkpoint_dir: resolve validating.save_best_dir or save_ckpt parent
  resolve_best_checkpoint_dir-->>Trainer: checkpoint_dir
  Trainer->>FullValidator: create validator with checkpoint_dir
  FullValidator->>checkpoint_dir: mkdir(parents=True, exist_ok=True)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5420: Shares the deepmd/pt/train/training.py checkpoint and EMA save/restore path changes in the same area.

Suggested labels

Python

Suggested reviewers

njzjz
iProzd
wanghan-iapcm

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is clearly related to the main change: customizable checkpoint save and retention behavior in PyTorch training/validation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt/train/utils.py`:
- Around line 283-284: The checkpoint retention calculation in the helper that
returns the keep count is undercounting because it ignores the final off-cadence
checkpoint written by Trainer.run() at num_steps. Update the logic around
total_periodic_ckpts/ckpt_keep_ratio so it accounts for the extra terminal
checkpoint (for example, by including the final step in the total when num_steps
is not an exact multiple of save_freq), and keep the existing max(1, ...)
safeguards intact.

In `@examples/water/dpa4/input.json`:
- Around line 123-124: The save_best_dir setting is unused in this example
because the validation path that triggers best-checkpoint saving is never
enabled. Update the input in this example by either turning on the
validating.full_validation flow so ckpt_best can be created, or remove the
save_best_dir field from the example to avoid misleading users; make the change
in the example configuration where tf32_infer and save_best_dir are defined.

In `@source/tests/pt/test_training.py`:
- Around line 967-968: Add the standard training test timeout guard to the new
validation test so it cannot hang CI; decorate
test_full_validation_save_best_dir with `@TRAINING_TEST_TIMEOUT` alongside the
existing `@patch` on FullValidator.evaluate_all_systems, matching the pattern used
by other training tests that call trainer.run().
- Around line 1228-1231: The checkpoint alias test is incorrectly asserting that
the prefix files are symlinks, which breaks on platforms where
symlink_prefix_files() falls back to copying. Update the test in the
checkpoint-saving area to validate the alias by checking that the Path resolves
to the expected target file, without requiring is_symlink(), using the existing
save_ckpt and ema_save_ckpt references so the test remains cross-platform.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40ff16a5-1016-4465-ba48-a71de2e87b50

📥 Commits

Reviewing files that changed from the base of the PR and between 5733301 and fc780b1.

📒 Files selected for processing (9)

deepmd/pt/train/training.py
deepmd/pt/train/utils.py
deepmd/pt/train/validation.py
deepmd/utils/argcheck.py
doc/train/training-advanced.md
examples/water/dpa4/input.json
source/tests/pt/test_train_utils.py
source/tests/pt/test_training.py
source/tests/pt/test_validation.py

coderabbitai · 2026-06-25T15:47:01Z

+    total_periodic_ckpts = max(1, num_steps // save_freq)
+    return max(1, ceil(ckpt_keep_ratio * total_periodic_ckpts))


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Count the final off-cadence checkpoint in the keep-ratio calculation.

Trainer.run() still writes a checkpoint at display_step_id == self.num_steps, so a run like num_steps=5, save_freq=2 produces 2, 4, 5. Using num_steps // save_freq counts only 2 and makes ckpt_keep_ratio evict one checkpoint too early.

Proposed fix

- total_periodic_ckpts = max(1, num_steps // save_freq) + total_periodic_ckpts = max(1, (num_steps + save_freq - 1) // save_freq) return max(1, ceil(ckpt_keep_ratio * total_periodic_ckpts))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

total_periodic_ckpts = max(1, num_steps // save_freq)

return max(1, ceil(ckpt_keep_ratio * total_periodic_ckpts))

total_periodic_ckpts = max(1, (num_steps + save_freq - 1) // save_freq)

return max(1, ceil(ckpt_keep_ratio * total_periodic_ckpts))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt/train/utils.py` around lines 283 - 284, The checkpoint retention calculation in the helper that returns the keep count is undercounting because it ignores the final off-cadence checkpoint written by Trainer.run() at num_steps. Update the logic around total_periodic_ckpts/ckpt_keep_ratio so it accounts for the extra terminal checkpoint (for example, by including the final step in the total when num_steps is not an exact multiple of save_freq), and keep the existing max(1, ...) safeguards intact.

coderabbitai · 2026-06-25T15:47:01Z

+    "tf32_infer": false,
+    "save_best_dir": "ckpt_best"


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

save_best_dir is a no-op in this example.

This file never enables validating.full_validation, so copying the example as-is will never create anything under ckpt_best. Either enable the validation path that exercises best-checkpoint saving, or drop this field from the example.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/water/dpa4/input.json` around lines 123 - 124, The save_best_dir setting is unused in this example because the validation path that triggers best-checkpoint saving is never enabled. Update the input in this example by either turning on the validating.full_validation flow so ckpt_best can be created, or remove the save_best_dir field from the example to avoid misleading users; make the change in the example configuration where tf32_infer and save_best_dir are defined.

coderabbitai · 2026-06-25T15:47:01Z

+    @patch("deepmd.pt.train.validation.FullValidator.evaluate_all_systems")
+    def test_full_validation_save_best_dir(self, mocked_eval) -> None:


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Add the standard timeout guard to this training test.

This new case calls trainer.run() but isn't wrapped in @TRAINING_TEST_TIMEOUT, so a regression here can hang CI instead of failing fast. As per coding guidelines, **/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes.

Proposed fix

+ `@TRAINING_TEST_TIMEOUT` `@patch`("deepmd.pt.train.validation.FullValidator.evaluate_all_systems") def test_full_validation_save_best_dir(self, mocked_eval) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@patch("deepmd.pt.train.validation.FullValidator.evaluate_all_systems")

def test_full_validation_save_best_dir(self, mocked_eval) -> None:

`@TRAINING_TEST_TIMEOUT`

`@patch`("deepmd.pt.train.validation.FullValidator.evaluate_all_systems")

def test_full_validation_save_best_dir(self, mocked_eval) -> None:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt/test_training.py` around lines 967 - 968, Add the standard training test timeout guard to the new validation test so it cannot hang CI; decorate test_full_validation_save_best_dir with `@TRAINING_TEST_TIMEOUT` alongside the existing `@patch` on FullValidator.evaluate_all_systems, matching the pattern used by other training tests that call trainer.run().

Source: Coding guidelines

coderabbitai · 2026-06-25T15:47:01Z

+        for prefix in (save_ckpt, ema_save_ckpt):
+            link = Path(f"{prefix}.pt")
+            self.assertTrue(link.is_symlink())
+            self.assertEqual(link.resolve(), (save_dir / f"{prefix}-4.pt").resolve())


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t require symlinks in this cross-platform checkpoint test.

symlink_prefix_files() copies files on Windows, so is_symlink() is false there even when the checkpoint alias is correct. That makes this new test fail on a platform the helper already supports.

Proposed fix

for prefix in (save_ckpt, ema_save_ckpt): link = Path(f"{prefix}.pt") - self.assertTrue(link.is_symlink()) - self.assertEqual(link.resolve(), (save_dir / f"{prefix}-4.pt").resolve()) + target = save_dir / f"{prefix}-4.pt" + self.assertTrue(link.exists()) + if os.name != "nt": + self.assertTrue(link.is_symlink()) + self.assertEqual(link.resolve(), target.resolve()) + else: + self.assertTrue(link.is_file()) + self.assertEqual(link.read_bytes(), target.read_bytes())

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for prefix in (save_ckpt, ema_save_ckpt):

link = Path(f"{prefix}.pt")

self.assertTrue(link.is_symlink())

self.assertEqual(link.resolve(), (save_dir / f"{prefix}-4.pt").resolve())

for prefix in (save_ckpt, ema_save_ckpt):

link = Path(f"{prefix}.pt")

target = save_dir / f"{prefix}-4.pt"

self.assertTrue(link.exists())

if os.name != "nt":

self.assertTrue(link.is_symlink())

self.assertEqual(link.resolve(), target.resolve())

else:

self.assertTrue(link.is_file())

self.assertEqual(link.read_bytes(), target.read_bytes())

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt/test_training.py` around lines 1228 - 1231, The checkpoint alias test is incorrectly asserting that the prefix files are symlinks, which breaks on platforms where symlink_prefix_files() falls back to copying. Update the test in the checkpoint-saving area to validate the alias by checking that the Path resolves to the expected target file, without requiring is_symlink(), using the existing save_ckpt and ema_save_ckpt references so the test remains cross-platform.

njzjz-bot

Thanks for adding the checkpoint save directory and ratio-based retention knobs. I found a few issues worth fixing before merge:

ckpt_keep_ratio currently under-counts when the final checkpoint is off-cadence. In resolve_keep_ckpt_count(), total_periodic_ckpts = num_steps // save_freq ignores the final checkpoint that Trainer.run() still writes when num_steps % save_freq != 0. For example, num_steps=5, save_freq=2, ckpt_keep_ratio=0.5 produces checkpoints at steps 2, 4, and 5, but the helper returns ceil(0.5 * (5 // 2)) = 1; the documented formula ceil(ckpt_keep_ratio * numb_steps / save_freq) would keep 2. Please account for the terminal checkpoint, e.g. use ceil(num_steps / save_freq) (with the existing minimum-one guard).
The new save_best_dir in examples/water/dpa4/input.json is misleading unless full validation is enabled. Since validating.full_validation defaults to false, ckpt_best will not actually be used by this example. Either enable the full-validation flow in the example or omit save_best_dir there.
The new test_save_dir_redirects_checkpoints_with_local_symlinks assumes Path(...).is_symlink(), but symlink_prefix_files() copies files on Windows. If these tests are expected to be portable, please avoid requiring symlinks in the assertion (or explicitly scope the test/docs to non-Windows behavior).

Reviewed by OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5).

codecov · 2026-06-25T17:31:05Z

Codecov Report

❌ Patch coverage is 90.47619% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.28%. Comparing base (5733301) to head (fc780b1).

Files with missing lines	Patch %	Lines
deepmd/pt/train/training.py	82.60%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5589   +/-   ##
=======================================
  Coverage   82.27%   82.28%           
=======================================
  Files         887      887           
  Lines      100331   100361   +30     
  Branches     4060     4058    -2     
=======================================
+ Hits        82550    82581   +31     
+ Misses      16320    16318    -2     
- Partials     1461     1462    +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

OutisLi added 2 commits June 25, 2026 23:36

feat(pt): add save_dir to set specific ckpt saving folder

a7c1635

feat(pt): add ckpt_keep_ratio to set max_ckpt_keep automatically

fc780b1

dosubot Bot added the new feature label Jun 25, 2026

OutisLi requested review from njzjz and wanghan-iapcm June 25, 2026 15:38

github-actions Bot added Python Docs Examples labels Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

njzjz-bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pt): add custom save behaviors#5589

feat(pt): add custom save behaviors#5589
OutisLi wants to merge 2 commits into
deepmodeling:masterfrom
OutisLi:pr/save

OutisLi commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

njzjz-bot left a comment

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		total_periodic_ckpts = max(1, num_steps // save_freq)
		return max(1, ceil(ckpt_keep_ratio * total_periodic_ckpts))

		@patch("deepmd.pt.train.validation.FullValidator.evaluate_all_systems")
		def test_full_validation_save_best_dir(self, mocked_eval) -> None:

Uh oh!

Conversation

OutisLi commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OutisLi commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading