add DPA-ADAPT toolkit for downstream property adaptation by zhaiwenxi · Pull Request #5572 · deepmodeling/deepmd-kit

zhaiwenxi · 2026-06-22T11:27:26Z

Summary

This PR adds DPA-ADAPT, a toolkit for adapting pretrained DPA models to downstream atomistic property prediction tasks.

The new package provides a scikit-learn-style Python API and standalone CLI for fine-tuning, descriptor extraction, prediction, evaluation, cross-validation, and data preparation, without requiring users to manually write DeePMD-kit training input files.

Main changes

Add the top-level dpa_adapt Python package.
Add standalone CLI entry points:
- dpa-adapt
- dpaad
Support multiple adaptation strategies:
- frozen_sklearn: frozen DPA descriptors with scikit-learn regressors
- frozen_head: train a property head on top of a frozen DPA backbone
- finetune: end-to-end DPA fine-tuning
- mft: multi-task fine-tuning with auxiliary energy/force training
Add data utilities for:
- DeepMD/npy loading and validation
- label attachment
- descriptor caching
- train/test split and cross-validation
- SMILES/formula-based conversion workflows
- optional frame parameters via fparam.npy
Add prediction and evaluation helpers with MAE, RMSE, and R2 reporting.
Add documentation under doc/dpa_adapt/.
Add a runnable QM9 HOMO-LUMO gap example under examples/dpa_adapt/.
Add dpa-adapt optional dependencies in pyproject.toml.
Add dedicated lightweight CI for source/tests/dpa_adapt/.

Co-authored-by: zirenjin <zirenjin@umich.edu>

Summary by CodeRabbit

New Features
- Added the DPA-ADAPT toolkit with a new command-line interface for data conversion, validation, training, prediction, evaluation, and descriptor extraction.
- Introduced support for multiple adaptation workflows, including frozen-sklearn, frozen-head, fine-tuning, and multi-task training.
- Added data handling for SMILES, formulas, structures, label attachment, and condition features.
- Included a new example workflow and expanded user documentation for setup and usage.

for more information, see https://pre-commit.ci

feat: add DeePMD property tools

for more information, see https://pre-commit.ci

Add property tools

… leak)

dpa_tools merge

…re paths

…t, unify --target-key

…t→convert)

…_path

…utput parsing - DPAFineTuner: extract _FrozenSklearnPipeline helper; keep public API unchanged - MFTFineTuner: defer _read_fitting_net_from_ckpt to first access - DPATrainer._parse_test_output: single anchored regex per metric, auto-detect format

…perty metrics - _load_labels: accept str | list[str], stack columns for multi-property - build_sklearn_head: n_outputs param, wrap RF/Ridge with MultiOutputRegressor - evaluate: per-property mae/rmse/r2 dict when target_key is a list - freeze/DPAPredictor: store and load target_key as-is (str or list) - CLI: --target-key homo,lumo parsed via _maybe_split_list - 6 new tests covering fit, evaluate, freeze/load round-trip

The old _load_descriptor_model, _validate_type_map, _remap_atom_types, _extract_features_cached, and _extract_features method bodies were left in place alongside the new thin delegators, causing CodeQL 'variable defined multiple times' warnings. Removed the old bodies; kept _extract_features_cached on DPAFineTuner directly so that test patches on DPAFineTuner._extract_features are honoured through the cache wrapper.

… method - Replace try/except ImportError in _unwrap_multioutput with direct import (sklearn is always available when dpa_tools is loaded) - Remove _FrozenSklearnPipeline.extract_features_cached (dead code; the caching wrapper lives on DPAFineTuner so test patches work)

The workflow still referenced the deleted deepmd_property_tools/ directory. Updated paths trigger to deepmd/dpa_tools/** and test command to source/tests/dpa_tools/. Added torch to lightweight dependencies.

numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight dependency install working on older Python.

refactor: unify dpa_tools CLI/API and merge deepmd_property_tools

The standalone dpa-adapt CLI mixed print() with the existing _LOG logger; ruff's T201 ("print found") flagged 26 print() calls. Route all output through _LOG (info/warning/error) to match the handlers that already use it and the project-wide ban on print().

Organize DPA-ADAPT docs navigation

Fix DPA adapt cache and CLI edge cases

style(dpa-adapt): route CLI output through logger to satisfy ruff T201

Fix dpa_adapt pre-commit hook failures

…k accumulator Add type annotations across the dpa_adapt library (ANN), replace print() with logging in example scripts (T201), annotate mutable class defaults with ClassVar (RUF012), replace legacy np.random.rand with Generator API (NPY002), escape regex metacharacters in pytest match patterns (RUF043), fix implicit Optional annotations (RUF013), ambiguous unicode (RUF002), zip() strict= (B905), docstring formatting (D301/D400), dict() → literal (C408), and TC003 import placement. Also fix _DescriptorExtraction._resolve_descriptor_hook_model to prefer atomic_model over dp_model. dp_model delegates set_eval_descriptor_hook and eval_descriptor to atomic_model but lacks the eval_descriptor_list attribute, so _clear_accumulator was a no-op. Descriptors from systems with different atom counts accumulated across forward passes, causing torch.concat to fail with "Expected size 5 but got size 4".

fix(dpa-adapt): resolve pre-commit ruff errors and descriptor hook accumulator

The disallow-caps pre-commit hook flags "DeepMD" as improper capitalization; the official project name is DeePMD.

+    - boxes      : (n_frames, 9) or None for non-periodic
+    - atom_types : (n_atoms,) int
+


+        np.save(cache_path, descriptors)
+        _LOG.info("Cached descriptors to %s", cache_path)
+


+    cache: bool = True,
+    type_map: list[str] | tuple[str, ...] | None = None,
+) -> np.ndarray:


fix: address CodeQL scan findings - File not closed: wrap open() in with statement (test_trainer.py, test_mft_evaluate.py) - Unused variable: remove dead n_total, n_splits assignments (test_split_cv.py, cv.py) - Self-import: remove circular import from self (test_validate.py) - Unused import: replace import rdkit with importlib.util.find_spec (test_auto_convert.py) - Empty except: add explanatory comments where exceptions are intentionally suppressed (predictor.py, mft.py, finetuner.py, smiles.py) - Statement has no effect: replace ... with pass (test_backend_contract.py) - Mixed import styles: use consistent from-import for module (test_finetuner_strategies.py) - Cyclic import: add comments explaining lazy import pattern (finetuner.py) @

Move load_or_extract() and ensure_per_system_cache() from dpa_adapt.data.desc_cache to dpa_adapt.finetuner. Those two functions need DPAFineTuner, while finetuner imports cache helpers from desc_cache, creating a CodeQL-flagged cyclic import. desc_cache.py now contains only pure cache-path and fingerprint helpers; the extraction+backfill functions live next to DPAFineTuner in finetuner.py. Updated imports in cv.py and test_cache.py accordingly.

fix: address CodeQL scan findings

for more information, see https://pre-commit.ci

+
+    def test_deterministic_folds_same_result_twice(self, tmp_path, monkeypatch):
+        formulas = [f"Comp{i}" for i in range(4)]
+        systems = _write_oer_tree(str(tmp_path), formulas, nsets=2, label_key="energy")


test_deterministic_folds_same_result_twice unconditionally skips (needs a real DPA checkpoint), but still ran stub setup code whose `systems` local was flagged by CodeQL as unused. Reduce it to a bare skip, matching the sibling test_manifest_folds.

- input_formats: drop the manual "1./2./3." heading prefixes that doubled with Sphinx auto-numbering (e.g. "9.2.1. 1. SMILES Tables (CSV)"). - conf.py: cap auto-generated CLI reference section numbering at depth 5 via a doctree-resolved hook, so sphinx-argparse's deep subcommand nesting no longer renders numbers like "9.3.3.6.3.1.1.". Scoped to the dpa_adapt/cli page only; other pages and the global TOC are untouched.

Incorporate zhaiwenxi's CodeQL-fix PR (#49). Conflict resolution: - finetuner.py: keep our structural fix for the desc_cache <-> finetuner import cycle (load_or_extract / ensure_per_system_cache now live in finetuner.py) rather than their lazy `from dpa_adapt.data.desc_cache import load_or_extract`, which would ImportError since that symbol moved. Their swallowed-exception comments on the cache read/write paths are kept (auto-merged). - test_split_cv.py: keep our bare-skip stub, which fully removes the unused `systems` flagged by CodeQL; their variant deleted only the rng/n_total lines and left `systems` assigned-but-unused. cv.py merged cleanly: their unused-`n_splits` removal plus our redirect of the ensure_per_system_cache import to dpa_adapt.finetuner.

njzjz-bot

Thanks for the latest round of cleanup. I re-checked the current head (6d1536952eab04be32da0d7c3d43ae5984b6ece3): the packaging metadata, docs navigation, reduced example dataset, descriptor cache identity, centralized dp lookup, type-map remapping, and CodeQL fixes are all in much better shape now. The current CodeQL and main CI checks are green.

I still see a few blockers before this is safe to merge:

The public DPAFineTuner(strategy="mft") path still breaks default MFT type-map auto-detection. DPAFineTuner.__init__ converts an omitted type_map from None to [], then _ensure_mft() passes that empty list into MFTFineTuner. However, MFTFineTuner only auto-detects the checkpoint type map when self.type_map is None; [] is treated as user-provided. As a result, the normal public/CLI path without --mft-type-map can either fail validation as “provided type_map is missing elements”, or emit an empty shared type_map if data loading is deferred/failed. Please preserve None for the MFT delegate, or make MFT treat an empty list like omitted.
MFT training still has inconsistent checkpoint locations. fit() writes mft_input.json under output_dir, but launches dp --pt train without cwd=self.output_dir; the generated MFT config also does not set training.save_ckpt. Later _freeze_ckpt() expects model.ckpt-<max_steps>.pt under self.output_dir. This means training can succeed while freeze/evaluate/predict fails because the checkpoint was written to the caller’s current directory/default location. Please either restore running dp train with cwd=output_dir, or add an explicit save_ckpt under output_dir (ideally both), with a regression test.
There is still a stray root-level test file. tests/test_dpa_tools.py remains tracked at the repository root. The main Python workflow only runs pytest ... source/tests, so this file is not covered by CI; the sdist excludes remove /source/tests, /doc, /examples, etc., but not /tests, so this test may also leak into source distributions. Please move it under source/tests/dpa_adapt/ if it is still needed, or delete it if it is superseded by the existing DPA-ADAPT tests.

One smaller validation gap: DPATrainer._validate_fparam() indexes shape[1] directly, so a malformed 1-D fparam.npy raises a raw IndexError instead of the intended DPADataError; it should check ndim first and preferably also preflight row count against coord.npy.

Authored by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot

Adding inline versions of the current blockers so they are easier to fix at the relevant lines.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

njzjz-bot · 2026-06-26T10:18:16Z

+        self._mft = None
+
+        # ---- backward-compat state mirrors (delegated to pipeline) ----
+        if self.type_map is None:


This None → [] normalization breaks the MFT default path. For strategy="mft", _ensure_mft() passes self.type_map into MFTFineTuner; [] is treated as user-provided, so _validate_and_resolve_type_map() skips checkpoint auto-detection and validates against an empty map. Please preserve None for MFT, or normalize only inside the frozen-sklearn path after dispatch.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

njzjz-bot · 2026-06-26T10:18:16Z

+                },
+            },
+            "numb_steps": t.max_steps,
+            "save_freq": t.save_freq,


MFT should also set an explicit checkpoint prefix, e.g. training["save_ckpt"] = os.path.join(t.output_dir, "model.ckpt") (matching DPATrainer). Right now only save_freq is emitted, so DeePMD writes model.ckpt-* relative to the process cwd while _freeze_ckpt() later looks under self.output_dir, making fit() succeed but evaluate()/predict() fail to find the checkpoint.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

njzjz-bot · 2026-06-26T10:18:16Z

+        _LOG.info("Log: %s", log_path)
+
+        with open(log_path, "w") as log_f:
+            process = subprocess.Popen(


This subprocess still inherits the caller's cwd. Because the MFT code writes mft_input.json/train.log under output_dir and later freezes from output_dir, training should run with a deterministic cwd (or every generated path in the JSON must be absolute, especially training.save_ckpt). Otherwise users calling from another directory get checkpoints in the wrong place.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

njzjz-bot · 2026-06-26T10:18:16Z

@@ -207,6 +217,7 @@ sdist.exclude = [
 ]


This exclude list still misses the repository-level /tests directory, while tests/test_dpa_tools.py is still tracked at the PR head. Since CI only runs source/tests, please either move/remove that root test or add /tests to sdist.exclude; otherwise the sdist keeps shipping an unrun, stale test tree.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

njzjz-bot · 2026-06-26T10:18:16Z

+                        f"fparam.npy of shape (n_frames, {fparam_dim})."
+                    )
+                shape = np.load(fpath).shape
+                if shape[1] != fparam_dim:


Please guard the dimensionality before indexing shape[1]. A 1-D fparam.npy currently raises a bare IndexError here instead of the advertised DPADataError; this should check len(shape) == 2 and also verify shape[0] matches the frame count in the corresponding set.

— OpenClaw 2026.6.8 (844f405), model: custom-chat-jinzhezeng-group/gpt-5.5

fix(dpa-adapt): resolve pre-commit errors, desc_cache import cycle, descriptor hook accumulator

zhaiwenxi and others added 30 commits May 27, 2026 16:08

feat: add DeePMD property tools

30351e9

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9fe00f

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

db05969

for more information, see https://pre-commit.ci

Merge pull request #1 from zhaiwenxi/add-property-tools

311a620

feat: add DeePMD property tools

Add SMILES coordinate generation for property tools

05479d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

4445f1d

for more information, see https://pre-commit.ci

Merge branch 'deepmodeling:master' into master

9be45cd

Merge pull request #2 from zhaiwenxi/add-property-tools

d5df6fa

Add property tools

feat: add dpa_tools as self-contained subpackage (PR 1)

52033d7

feat: add dp dpa CLI subcommand group (Branch A)

3e0c3f9

feat: centralize deepmd API calls into _backend.py chokepoint (Branch B)

ffe609c

Merge branch-b-backend (_backend.py chokepoint)

beb7b42

fix: use yield fixture for contract test hook cleanup (prevents state…

ab024dc

… leak)

docs: add dpa_tools Python and CLI API reference

da3f26f

Merge pull request #3 from zirenjin/master

bb3c971

dpa_tools merge

feat: merge property_tools SMILES pipeline into dpa_tools

57f61bd

feat: auto-detect format in dp dpa data convert, unify SMILES+structu…

f61f0c2

…re paths

chore: remove deepmd_property_tools, migrate tests+data to dpa_tools

392a1a5

chore: rename DATA/ → demo/

871d600

docs: update README — add SMILES pipeline, auto_convert, demo data

8a8ec93

refactor: fold mft into fit --strategy mft, batch-convert into conver…

ae78fea

…t, unify --target-key

docs: update README for refactored CLI and API (mft→fit, batch-conver…

fbfb5a0

…t→convert)

feat: auto-download built-in pretrained models via resolve_pretrained…

5bb1b53

…_path

fix: update property_tools_tests CI after migration to dpa_tools

217868c

The workflow still referenced the deleted deepmd_property_tools/ directory. Updated paths trigger to deepmd/dpa_tools/** and test command to source/tests/dpa_tools/. Added torch to lightweight dependencies.

fix: pin numpy<2.2 in lightweight CI for Python 3.10 compat

3b1ed2c

numpy 2.3+ requires Python>=3.11, but the property_tools_tests workflow runs on Python 3.10. Pin numpy>=1.21,<2.2 to keep the lightweight dependency install working on older Python.

Merge pull request #4 from zirenjin/master

93b2c5d

refactor: unify dpa_tools CLI/API and merge deepmd_property_tools

zirenjin and others added 16 commits June 25, 2026 11:20

Merge pull request #44 from zhaiwenxi/fix-mft-aux-prob-validation

564b077

Organize DPA-ADAPT docs navigation

Fix DPA adapt cache and CLI edge cases

f5f0bac

Merge pull request #46 from zhaiwenxi/fix/dpa-adapt-cache-cli

e4d37c6

Fix DPA adapt cache and CLI edge cases

Merge pull request #45 from zirenjin/master

651edda

style(dpa-adapt): route CLI output through logger to satisfy ruff T201

Fix dpa_adapt pre-commit hook failures

d6f3d70

Merge branch 'deepmodeling:master' into master

8e1493b

Merge pull request #47 from zhaiwenxi/fix/pre-commit-hooks

e3b6126

Fix dpa_adapt pre-commit hook failures

Merge branch 'deepmodeling:master' into master

bc1fddd

Merge pull request #48 from zirenjin/master

dd91555

fix(dpa-adapt): resolve pre-commit ruff errors and descriptor hook accumulator

style(dpa-adapt): fix DeepMD → DeePMD capitalization

2c6bcfd

The disallow-caps pre-commit hook flags "DeepMD" as improper capitalization; the official project name is DeePMD.

Fix dpa_adapt pre-commit formatting

63a2902

Merge branch 'zhaiwenxi:master' into master

71f8bb2

Merge branch 'deepmodeling:master' into master

659d090

Fix remaining dpa_adapt capitalization checks

ebb2f58

github-advanced-security AI found potential problems Jun 26, 2026

View reviewed changes

zhaiwenxi mentioned this pull request Jun 26, 2026

fix: address CodeQL scan findings zhaiwenxi/deepmd-kit#49

Merged

zirenjin and others added 4 commits June 26, 2026 11:51

Merge branch 'zhaiwenxi:master' into master

289bb96

Merge pull request #49 from zhaiwenxi/fix/codeql-issues

6b5277f

fix: address CodeQL scan findings

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d15369

for more information, see https://pre-commit.ci

github-advanced-security AI found potential problems Jun 26, 2026

View reviewed changes

Comment thread source/tests/dpa_adapt/test_split_cv.py Outdated

def test_deterministic_folds_same_result_twice(self, tmp_path, monkeypatch):

formulas = [f"Comp{i}" for i in range(4)]

systems = _write_oer_tree(str(tmp_path), formulas, nsets=2, label_key="energy")

zirenjin added 3 commits June 26, 2026 16:32

njzjz-bot suggested changes Jun 26, 2026

View reviewed changes

njzjz-bot reviewed Jun 26, 2026

View reviewed changes

Merge pull request #50 from zirenjin/master

d4cad11

fix(dpa-adapt): resolve pre-commit errors, desc_cache import cycle, descriptor hook accumulator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add DPA-ADAPT toolkit for downstream property adaptation#5572

add DPA-ADAPT toolkit for downstream property adaptation#5572
zhaiwenxi wants to merge 212 commits into
deepmodeling:masterfrom
zhaiwenxi:master

zhaiwenxi commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njzjz-bot left a comment

Uh oh!

njzjz-bot left a comment

Uh oh!

njzjz-bot Jun 26, 2026

Uh oh!

njzjz-bot Jun 26, 2026

Uh oh!

njzjz-bot Jun 26, 2026

Uh oh!

njzjz-bot Jun 26, 2026

Uh oh!

njzjz-bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

		- boxes : (n_frames, 9) or None for non-periodic
		- atom_types : (n_atoms,) int

		np.save(cache_path, descriptors)
		_LOG.info("Cached descriptors to %s", cache_path)

Uh oh!

Conversation

zhaiwenxi commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main changes

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zhaiwenxi commented Jun 22, 2026 •

edited

Loading