[Feature] Switch from numpy void() to frombuffer() by jan-janssen · Pull Request #984 · pyiron/executorlib

jan-janssen · 2026-05-08T09:43:58Z

p.void is a fixed-size raw byte scalar. Its size is stored internally using NumPy’s npy_intp / Python Py_ssize_t-like sizing, but in practice some NumPy scalar/array paths still hit a ~2 GiB signed 32-bit limit (2**31 - 1) for a single element / scalar buffer. So yes: the limit you are hitting is plausibly on the order of 2 GB, not a cloudpickle limit.

For large pickles, don’t store the whole pickle as one np.void. Store it as a byte array dataset instead:

import cloudpickle
import numpy as np
import h5py

obj = ...
blob = cloudpickle.dumps(obj)

with h5py.File("x.h5", "w") as f:
    f.create_dataset(
        "pickle",
        data=np.frombuffer(blob, dtype=np.uint8),
        compression="gzip",  # optional
    )

Read it back:

with h5py.File("x.h5", "r") as f:
    blob = f["pickle"][()].tobytes()

obj = cloudpickle.loads(blob)

Summary by CodeRabbit

Bug Fixes
- Improved handling of missing optional data groups so operations succeed with sensible defaults and errors propagate correctly when required data is absent.
Chores
- Switched persisted payloads to a more compact, compressed binary storage format for safer, more reliable read/write across all retrieval paths.
Tests
- Added backwards-compatibility tests for persistence, recovery, queue IDs and error propagation to ensure older cache files remain supported.

coderabbitai · 2026-05-08T09:44:11Z

📝 Walkthrough

Walkthrough

This PR changes HDF5 persistence to store cloudpickle byte streams as gzip-compressed uint8 arrays and updates all read paths to reconstruct objects via hdf['/key'][()].tobytes() → cloudpickle.loads(); load() also applies explicit defaults for missing optional groups. Tests for backwards compatibility are added.

Changes

HDF5 Serialization Format Upgrade

Layer / File(s)	Summary
HDF5 Dump Serialization `src/executorlib/standalone/hdf.py`	`dump()` now encodes mapped groups as gzip-compressed `uint8` byte arrays from `cloudpickle.dumps` instead of `np.void(...)`.
HDF5 Load Deserialization `src/executorlib/standalone/hdf.py`	`load()` deserializes datasets using `hdf["/key"][()].tobytes()` → `cloudpickle.loads()`, sets defaults for optional fields (`args`, `kwargs`, `resource_dict`, `error_log_file`) when missing, and still raises `TypeError` if `function` is absent.
Accessor Function Deserialization `src/executorlib/standalone/hdf.py`	`get_output()`, `get_runtime()`, `get_queue_id()`, and `_get_content_of_file()` use `[()].tobytes()` + `cloudpickle.loads()` to read individual components; absence semantics preserved (e.g., runtime → `0.0`).
Backwards-compatibility tests `tests/unit/standalone/test_hdf_backwards.py`	New tests write pickled values into HDF5 and verify load/get_output/get_runtime/get_queue_id/get_future_from_cache across multiple cache scenarios and missing-field cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pyiron/executorlib#740: Tests and changes touching missing-key load semantics and accessor behavior in executorlib.standalone.hdf.

Poem

🐰 I nibble bytes from HDF trees,
Cloudpickle-wrapped in gzip breeze,
Arrays of uint8 snug and tight,
Old pickles wake to morning light,
A hopping test ensures delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: switching from numpy void() to frombuffer() for HDF5 dataset serialization, which is reflected in the code changes in hdf.py and backwards compatibility tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch npfrombuffer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

for more information, see https://pre-commit.ci

codecov · 2026-05-08T09:48:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.24%. Comparing base (de44ab0) to head (86cccd0).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #984   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files          39       39           
  Lines        2119     2119           
=======================================
  Hits         1997     1997           
  Misses        122      122

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

src/executorlib/standalone/hdf.py (1)

39-43: ⚡ Quick win

Extract shared pickle/HDF encode-decode helpers to prevent format drift.

The serialization/deserialization expression is duplicated across many sites; centralizing it will make future format changes safer.

Proposed refactor

@@
 import cloudpickle
 import h5py
 import numpy as np
@@
+def _serialize_to_uint8(value: Any) -> np.ndarray:
+    return np.frombuffer(cloudpickle.dumps(value), dtype=np.uint8)
+
+
+def _deserialize_from_key(hdf: h5py.File, key: str) -> Any:
+    return cloudpickle.loads(hdf[f"/{key}"][()].tobytes())
+
+
 def dump(file_name: Optional[str], data_dict: dict) -> None:
@@
                     fname.create_dataset(
                         name="/" + group_dict[data_key],
-                        data=np.frombuffer(
-                            cloudpickle.dumps(data_value), dtype=np.uint8
-                        ),
+                        data=_serialize_to_uint8(data_value),
                         compression="gzip",
                     )
@@
-            data_dict["fn"] = cloudpickle.loads(hdf["/function"][()].tobytes())
+            data_dict["fn"] = _deserialize_from_key(hdf, "function")
@@
-            data_dict["args"] = cloudpickle.loads(hdf["/input_args"][()].tobytes())
+            data_dict["args"] = _deserialize_from_key(hdf, "input_args")

Also applies to: 59-79, 97-99, 126-126, 144-144, 230-230

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executorlib/standalone/hdf.py` around lines 39 - 43, Replace the repeated
serialization/deserialization expression (cloudpickle.dumps(...) wrapped with
np.frombuffer(..., dtype=np.uint8) and compression="gzip") with two shared
helpers—e.g., encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to
write to HDF and returns any needed metadata, and
decode_pickle_from_hdf(uint8_array) that calls cloudpickle.loads on the buffer
when reading; update every site currently doing cloudpickle.dumps +
np.frombuffer + compression="gzip" (the duplicated expression) to call
encode_pickle_for_hdf when writing and decode_pickle_from_hdf when reading so
all places use a single implementation and a single compression/format contract.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/executorlib/standalone/hdf.py`:
- Around line 39-43: Replace the repeated serialization/deserialization
expression (cloudpickle.dumps(...) wrapped with np.frombuffer(...,
dtype=np.uint8) and compression="gzip") with two shared helpers—e.g.,
encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to write to HDF
and returns any needed metadata, and decode_pickle_from_hdf(uint8_array) that
calls cloudpickle.loads on the buffer when reading; update every site currently
doing cloudpickle.dumps + np.frombuffer + compression="gzip" (the duplicated
expression) to call encode_pickle_for_hdf when writing and
decode_pickle_from_hdf when reading so all places use a single implementation
and a single compression/format contract.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d5ec173-654e-4ee6-90a8-c48f7fcbfa04

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63cb1 and c65a65b.

📒 Files selected for processing (1)

src/executorlib/standalone/hdf.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/standalone/test_hdf_backwards.py`:
- Around line 76-86: The test creates a cache file named via file_name =
os.path.join(cache_directory, "test_mixed_i.h5") but later calls
get_future_from_cache(..., cache_key="test_mixed"), causing a mismatch; update
either the dump call or the cache_key so they match (e.g. change the filename
stem to "test_mixed.h5" or change cache_key to "test_mixed_i") so dump(...) and
get_future_from_cache(...) refer to the same cache key/file when using the dump
and get_future_from_cache functions.
- Around line 34-38: The code is suppressing all ValueError around
fname.create_dataset which can hide real serialization/write errors; instead,
check for an existing dataset name before creating to only skip duplicates:
construct the target name as name = "/" + group_dict[data_key], then if name not
in fname call fname.create_dataset(name=name,
data=np.void(cloudpickle.dumps(data_value))); otherwise skip (or optionally log)
so only true duplicate cases are avoided and other ValueErrors still surface.
Reference: contextlib.suppress, fname.create_dataset, np.void,
cloudpickle.dumps, group_dict, data_key.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d10f469-e13f-4af7-bf87-8a8d3a8b113e

📥 Commits

Reviewing files that changed from the base of the PR and between c65a65b and 86cccd0.

📒 Files selected for processing (2)

src/executorlib/standalone/hdf.py
tests/unit/standalone/test_hdf_backwards.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/executorlib/standalone/hdf.py

coderabbitai · 2026-06-12T15:56:58Z

+                        with contextlib.suppress(ValueError):
+                            fname.create_dataset(
+                                name="/" + group_dict[data_key],
+                                data=np.void(cloudpickle.dumps(data_value)),
+                            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid suppressing all ValueError in legacy dump helper; it can hide real write/serialization failures.

Line 34 currently suppresses any ValueError, not just duplicate-dataset cases. In these compatibility tests, that can mask broken fixture creation and produce misleading downstream failures.

Suggested fix

- with contextlib.suppress(ValueError): - fname.create_dataset( - name="/" + group_dict[data_key], - data=np.void(cloudpickle.dumps(data_value)), - ) + dataset_name = "/" + group_dict[data_key] + if dataset_name in fname: + continue + fname.create_dataset( + name=dataset_name, + data=np.void(cloudpickle.dumps(data_value)), + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with contextlib.suppress(ValueError):

fname.create_dataset(

name="/" + group_dict[data_key],

data=np.void(cloudpickle.dumps(data_value)),

)

dataset_name = "/" + group_dict[data_key]

if dataset_name in fname:

continue

fname.create_dataset(

name=dataset_name,

data=np.void(cloudpickle.dumps(data_value)),

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/standalone/test_hdf_backwards.py` around lines 34 - 38, The code is suppressing all ValueError around fname.create_dataset which can hide real serialization/write errors; instead, check for an existing dataset name before creating to only skip duplicates: construct the target name as name = "/" + group_dict[data_key], then if name not in fname call fname.create_dataset(name=name, data=np.void(cloudpickle.dumps(data_value))); otherwise skip (or optionally log) so only true duplicate cases are avoided and other ValueErrors still surface. Reference: contextlib.suppress, fname.create_dataset, np.void, cloudpickle.dumps, group_dict, data_key.

coderabbitai · 2026-06-12T15:56:59Z

+        file_name = os.path.join(cache_directory, "test_mixed_i.h5")
+        a = 1
+        b = 2
+        dump(
+            file_name=file_name,
+            data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},
+        )
+        future = get_future_from_cache(
+            cache_directory=cache_directory,
+            cache_key="test_mixed",
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

test_get_future_from_file uses inconsistent cache filename/key and will fail deterministically.

Line 76 writes test_mixed_i.h5, but Line 85 queries cache_key="test_mixed" (which resolves to a different cache file stem). This makes the test assert the happy path while setting up a missing-file path.

Suggested fix

- file_name = os.path.join(cache_directory, "test_mixed_i.h5") + file_name = os.path.join(cache_directory, "test_mixed.h5")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

file_name = os.path.join(cache_directory, "test_mixed_i.h5")

a = 1

b = 2

dump(

file_name=file_name,

data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},

)

future = get_future_from_cache(

cache_directory=cache_directory,

cache_key="test_mixed",

)

file_name = os.path.join(cache_directory, "test_mixed.h5")

a = 1

b = 2

dump(

file_name=file_name,

data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},

)

future = get_future_from_cache(

cache_directory=cache_directory,

cache_key="test_mixed",

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/standalone/test_hdf_backwards.py` around lines 76 - 86, The test creates a cache file named via file_name = os.path.join(cache_directory, "test_mixed_i.h5") but later calls get_future_from_cache(..., cache_key="test_mixed"), causing a mismatch; update either the dump call or the cache_key so they match (e.g. change the filename stem to "test_mixed.h5" or change cache_key to "test_mixed_i") so dump(...) and get_future_from_cache(...) refer to the same cache key/file when using the dump and get_future_from_cache functions.

jan-janssen added 2 commits May 8, 2026 11:40

Switch from numpy void() to frombuffer()

072779d

fix reader

bcac159

[pre-commit.ci] auto fixes from pre-commit.com hooks

c65a65b

for more information, see https://pre-commit.ci

jan-janssen mentioned this pull request May 8, 2026

[bug] backend_write_file should catch serialization errors and write them to the output file #982

Open

jan-janssen marked this pull request as draft May 8, 2026 09:47

jan-janssen changed the title ~~Switch from numpy void() to frombuffer()~~ [Feature] Switch from numpy void() to frombuffer() - requires major release May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

jan-janssen and others added 2 commits June 12, 2026 15:54

Merge branch 'main' into npfrombuffer

55297c9

Add backwards compatibility tests

86cccd0

jan-janssen marked this pull request as ready for review June 12, 2026 15:52

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

jan-janssen merged commit 1ec85bf into main Jun 12, 2026
36 checks passed

jan-janssen deleted the npfrombuffer branch June 12, 2026 16:12

jan-janssen changed the title ~~[Feature] Switch from numpy void() to frombuffer() - requires major release~~ [Feature] Switch from numpy void() to frombuffer() Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Switch from numpy void() to frombuffer()#984

[Feature] Switch from numpy void() to frombuffer()#984
jan-janssen merged 5 commits into
mainfrom
npfrombuffer

jan-janssen commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jan-janssen commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-janssen commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading