Skip to content

[Feature] Switch from numpy void() to frombuffer()#984

Merged
jan-janssen merged 5 commits into
mainfrom
npfrombuffer
Jun 12, 2026
Merged

[Feature] Switch from numpy void() to frombuffer()#984
jan-janssen merged 5 commits into
mainfrom
npfrombuffer

Conversation

@jan-janssen

@jan-janssen jan-janssen commented May 8, 2026

Copy link
Copy Markdown
Member

p.void is a fixed-size raw byte scalar. Its size is stored internally using NumPy’s npy_intp / Python Py_ssize_t-like sizing, but in practice some NumPy scalar/array paths still hit a ~2 GiB signed 32-bit limit (2**31 - 1) for a single element / scalar buffer. So yes: the limit you are hitting is plausibly on the order of 2 GB, not a cloudpickle limit.

For large pickles, don’t store the whole pickle as one np.void. Store it as a byte array dataset instead:

import cloudpickle
import numpy as np
import h5py

obj = ...
blob = cloudpickle.dumps(obj)

with h5py.File("x.h5", "w") as f:
    f.create_dataset(
        "pickle",
        data=np.frombuffer(blob, dtype=np.uint8),
        compression="gzip",  # optional
    )

Read it back:

with h5py.File("x.h5", "r") as f:
    blob = f["pickle"][()].tobytes()

obj = cloudpickle.loads(blob)

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of missing optional data groups so operations succeed with sensible defaults and errors propagate correctly when required data is absent.
  • Chores

    • Switched persisted payloads to a more compact, compressed binary storage format for safer, more reliable read/write across all retrieval paths.
  • Tests

    • Added backwards-compatibility tests for persistence, recovery, queue IDs and error propagation to ensure older cache files remain supported.

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR changes HDF5 persistence to store cloudpickle byte streams as gzip-compressed uint8 arrays and updates all read paths to reconstruct objects via hdf['/key'][()].tobytes()cloudpickle.loads(); load() also applies explicit defaults for missing optional groups. Tests for backwards compatibility are added.

Changes

HDF5 Serialization Format Upgrade

Layer / File(s) Summary
HDF5 Dump Serialization
src/executorlib/standalone/hdf.py
dump() now encodes mapped groups as gzip-compressed uint8 byte arrays from cloudpickle.dumps instead of np.void(...).
HDF5 Load Deserialization
src/executorlib/standalone/hdf.py
load() deserializes datasets using hdf["/key"][()].tobytes()cloudpickle.loads(), sets defaults for optional fields (args, kwargs, resource_dict, error_log_file) when missing, and still raises TypeError if function is absent.
Accessor Function Deserialization
src/executorlib/standalone/hdf.py
get_output(), get_runtime(), get_queue_id(), and _get_content_of_file() use [()].tobytes() + cloudpickle.loads() to read individual components; absence semantics preserved (e.g., runtime → 0.0).
Backwards-compatibility tests
tests/unit/standalone/test_hdf_backwards.py
New tests write pickled values into HDF5 and verify load/get_output/get_runtime/get_queue_id/get_future_from_cache across multiple cache scenarios and missing-field cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pyiron/executorlib#740: Tests and changes touching missing-key load semantics and accessor behavior in executorlib.standalone.hdf.

Poem

🐰 I nibble bytes from HDF trees,
Cloudpickle-wrapped in gzip breeze,
Arrays of uint8 snug and tight,
Old pickles wake to morning light,
A hopping test ensures delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: switching from numpy void() to frombuffer() for HDF5 dataset serialization, which is reflected in the code changes in hdf.py and backwards compatibility tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch npfrombuffer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jan-janssen jan-janssen marked this pull request as draft May 8, 2026 09:47
@jan-janssen jan-janssen changed the title Switch from numpy void() to frombuffer() [Feature] Switch from numpy void() to frombuffer() - requires major release May 8, 2026
@codecov

codecov Bot commented May 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.24%. Comparing base (de44ab0) to head (86cccd0).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #984   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files          39       39           
  Lines        2119     2119           
=======================================
  Hits         1997     1997           
  Misses        122      122           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/executorlib/standalone/hdf.py (1)

39-43: ⚡ Quick win

Extract shared pickle/HDF encode-decode helpers to prevent format drift.

The serialization/deserialization expression is duplicated across many sites; centralizing it will make future format changes safer.

Proposed refactor
@@
 import cloudpickle
 import h5py
 import numpy as np
@@
+def _serialize_to_uint8(value: Any) -> np.ndarray:
+    return np.frombuffer(cloudpickle.dumps(value), dtype=np.uint8)
+
+
+def _deserialize_from_key(hdf: h5py.File, key: str) -> Any:
+    return cloudpickle.loads(hdf[f"/{key}"][()].tobytes())
+
+
 def dump(file_name: Optional[str], data_dict: dict) -> None:
@@
                     fname.create_dataset(
                         name="/" + group_dict[data_key],
-                        data=np.frombuffer(
-                            cloudpickle.dumps(data_value), dtype=np.uint8
-                        ),
+                        data=_serialize_to_uint8(data_value),
                         compression="gzip",
                     )
@@
-            data_dict["fn"] = cloudpickle.loads(hdf["/function"][()].tobytes())
+            data_dict["fn"] = _deserialize_from_key(hdf, "function")
@@
-            data_dict["args"] = cloudpickle.loads(hdf["/input_args"][()].tobytes())
+            data_dict["args"] = _deserialize_from_key(hdf, "input_args")

Also applies to: 59-79, 97-99, 126-126, 144-144, 230-230

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executorlib/standalone/hdf.py` around lines 39 - 43, Replace the repeated
serialization/deserialization expression (cloudpickle.dumps(...) wrapped with
np.frombuffer(..., dtype=np.uint8) and compression="gzip") with two shared
helpers—e.g., encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to
write to HDF and returns any needed metadata, and
decode_pickle_from_hdf(uint8_array) that calls cloudpickle.loads on the buffer
when reading; update every site currently doing cloudpickle.dumps +
np.frombuffer + compression="gzip" (the duplicated expression) to call
encode_pickle_for_hdf when writing and decode_pickle_from_hdf when reading so
all places use a single implementation and a single compression/format contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/executorlib/standalone/hdf.py`:
- Around line 39-43: Replace the repeated serialization/deserialization
expression (cloudpickle.dumps(...) wrapped with np.frombuffer(...,
dtype=np.uint8) and compression="gzip") with two shared helpers—e.g.,
encode_pickle_for_hdf(obj) that returns the uint8 ndarray ready to write to HDF
and returns any needed metadata, and decode_pickle_from_hdf(uint8_array) that
calls cloudpickle.loads on the buffer when reading; update every site currently
doing cloudpickle.dumps + np.frombuffer + compression="gzip" (the duplicated
expression) to call encode_pickle_for_hdf when writing and
decode_pickle_from_hdf when reading so all places use a single implementation
and a single compression/format contract.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d5ec173-654e-4ee6-90a8-c48f7fcbfa04

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63cb1 and c65a65b.

📒 Files selected for processing (1)
  • src/executorlib/standalone/hdf.py

@jan-janssen jan-janssen marked this pull request as ready for review June 12, 2026 15:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/standalone/test_hdf_backwards.py`:
- Around line 76-86: The test creates a cache file named via file_name =
os.path.join(cache_directory, "test_mixed_i.h5") but later calls
get_future_from_cache(..., cache_key="test_mixed"), causing a mismatch; update
either the dump call or the cache_key so they match (e.g. change the filename
stem to "test_mixed.h5" or change cache_key to "test_mixed_i") so dump(...) and
get_future_from_cache(...) refer to the same cache key/file when using the dump
and get_future_from_cache functions.
- Around line 34-38: The code is suppressing all ValueError around
fname.create_dataset which can hide real serialization/write errors; instead,
check for an existing dataset name before creating to only skip duplicates:
construct the target name as name = "/" + group_dict[data_key], then if name not
in fname call fname.create_dataset(name=name,
data=np.void(cloudpickle.dumps(data_value))); otherwise skip (or optionally log)
so only true duplicate cases are avoided and other ValueErrors still surface.
Reference: contextlib.suppress, fname.create_dataset, np.void,
cloudpickle.dumps, group_dict, data_key.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d10f469-e13f-4af7-bf87-8a8d3a8b113e

📥 Commits

Reviewing files that changed from the base of the PR and between c65a65b and 86cccd0.

📒 Files selected for processing (2)
  • src/executorlib/standalone/hdf.py
  • tests/unit/standalone/test_hdf_backwards.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/executorlib/standalone/hdf.py

Comment on lines +34 to +38
with contextlib.suppress(ValueError):
fname.create_dataset(
name="/" + group_dict[data_key],
data=np.void(cloudpickle.dumps(data_value)),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid suppressing all ValueError in legacy dump helper; it can hide real write/serialization failures.

Line 34 currently suppresses any ValueError, not just duplicate-dataset cases. In these compatibility tests, that can mask broken fixture creation and produce misleading downstream failures.

Suggested fix
-                        with contextlib.suppress(ValueError):
-                            fname.create_dataset(
-                                name="/" + group_dict[data_key],
-                                data=np.void(cloudpickle.dumps(data_value)),
-                            )
+                        dataset_name = "/" + group_dict[data_key]
+                        if dataset_name in fname:
+                            continue
+                        fname.create_dataset(
+                            name=dataset_name,
+                            data=np.void(cloudpickle.dumps(data_value)),
+                        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with contextlib.suppress(ValueError):
fname.create_dataset(
name="/" + group_dict[data_key],
data=np.void(cloudpickle.dumps(data_value)),
)
dataset_name = "/" + group_dict[data_key]
if dataset_name in fname:
continue
fname.create_dataset(
name=dataset_name,
data=np.void(cloudpickle.dumps(data_value)),
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/standalone/test_hdf_backwards.py` around lines 34 - 38, The code
is suppressing all ValueError around fname.create_dataset which can hide real
serialization/write errors; instead, check for an existing dataset name before
creating to only skip duplicates: construct the target name as name = "/" +
group_dict[data_key], then if name not in fname call
fname.create_dataset(name=name, data=np.void(cloudpickle.dumps(data_value)));
otherwise skip (or optionally log) so only true duplicate cases are avoided and
other ValueErrors still surface. Reference: contextlib.suppress,
fname.create_dataset, np.void, cloudpickle.dumps, group_dict, data_key.

Comment on lines +76 to +86
file_name = os.path.join(cache_directory, "test_mixed_i.h5")
a = 1
b = 2
dump(
file_name=file_name,
data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},
)
future = get_future_from_cache(
cache_directory=cache_directory,
cache_key="test_mixed",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

test_get_future_from_file uses inconsistent cache filename/key and will fail deterministically.

Line 76 writes test_mixed_i.h5, but Line 85 queries cache_key="test_mixed" (which resolves to a different cache file stem). This makes the test assert the happy path while setting up a missing-file path.

Suggested fix
-        file_name = os.path.join(cache_directory, "test_mixed_i.h5")
+        file_name = os.path.join(cache_directory, "test_mixed.h5")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_name = os.path.join(cache_directory, "test_mixed_i.h5")
a = 1
b = 2
dump(
file_name=file_name,
data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},
)
future = get_future_from_cache(
cache_directory=cache_directory,
cache_key="test_mixed",
)
file_name = os.path.join(cache_directory, "test_mixed.h5")
a = 1
b = 2
dump(
file_name=file_name,
data_dict={"fn": my_funct, "args": [a], "kwargs": {"b": b}},
)
future = get_future_from_cache(
cache_directory=cache_directory,
cache_key="test_mixed",
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/standalone/test_hdf_backwards.py` around lines 76 - 86, The test
creates a cache file named via file_name = os.path.join(cache_directory,
"test_mixed_i.h5") but later calls get_future_from_cache(...,
cache_key="test_mixed"), causing a mismatch; update either the dump call or the
cache_key so they match (e.g. change the filename stem to "test_mixed.h5" or
change cache_key to "test_mixed_i") so dump(...) and get_future_from_cache(...)
refer to the same cache key/file when using the dump and get_future_from_cache
functions.

@jan-janssen jan-janssen merged commit 1ec85bf into main Jun 12, 2026
36 checks passed
@jan-janssen jan-janssen deleted the npfrombuffer branch June 12, 2026 16:12
@jan-janssen jan-janssen changed the title [Feature] Switch from numpy void() to frombuffer() - requires major release [Feature] Switch from numpy void() to frombuffer() Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant