Skip to content

docs: user guide + runnable examples for distributing expressions#1547

Open
timsaucer wants to merge 4 commits into
apache:mainfrom
timsaucer:pr4-docs-examples
Open

docs: user guide + runnable examples for distributing expressions#1547
timsaucer wants to merge 4 commits into
apache:mainfrom
timsaucer:pr4-docs-examples

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 15, 2026

Which issue does this PR close?

Closes #1520

Rationale for this change

PRs 1-3 close the round-trip for Python UDFs and add the toggle that controls it. None of that is discoverable without user-facing documentation. This PR ships the user guide page that explains the multiprocessing / Ray / datafusion-distributed patterns, the runnable examples, and the centralized Security section that nails down what the toggle does and does not protect against.

What changes are included in this PR?

User guide.

  • The shipped-expression model (what travels inline vs by name).
  • Worker setup (datafusion.ipc.set_worker_ctx).
  • Sender-side configuration (datafusion.ipc.set_sender_ctx and SessionContext.with_python_udf_inlining).
  • A Security section that is the single source of truth for the cloudpickle / pickle.loads threat model.
  • Pointers to the runnable examples.

Runnable examples.

  • examples/multiprocessing_pickle_expr.pyPool.map of a closure-capturing UDF across processes, with the worker initializer wiring the worker context. The closure carries non-trivial state to demonstrate that captured state survives the round-trip.
  • examples/ray_pickle_expr.py — Ray actor analogue.
  • `examples/datafusion-ffi-example/python/tests/_test_pickle_strict_ffi.py`` — strict-mode refusal exercised end-to-end against an FFI capsule scalar UDF. Kept under the FFI example crate because it needs that crate's compiled artifacts.

Are there any user-facing changes?

Docs and examples only. No code behavior changes, no new public APIs.

Wraps up the Expr-pickle work with the user-facing material:

* docs/source/user-guide/io/distributing_work.rst — new user guide
  page covering the multiprocessing, Ray, and datafusion-distributed
  patterns. Includes the Security section that is the canonical home
  for the cloudpickle / pickle.loads threat model.
* docs/source/user-guide/io/index.rst — toctree entry.
* examples/multiprocessing_pickle_expr.py — runnable example: a
  Pool.map of a closure-capturing UDF across processes, with worker
  context registration in the initializer.
* examples/ray_pickle_expr.py — Ray actor analogue.
* examples/datafusion-ffi-example/python/tests/_test_pickle_strict_ffi.py
  — exercises the strict-mode refusal end to end against an FFI
  capsule scalar UDF (kept under the FFI example crate because the
  test needs that crate's compiled artifacts).
* examples/README.md — index entries for the new files.

Also tightens three docstrings that previously duplicated the
security warning so they point at the canonical Security section
instead:

* PythonLogicalCodec::with_python_udf_inlining (rustdoc): one-line
  summary plus a relative pointer to distributing_work.rst and the
  upstream Python pickle module security warning.
* SessionContext.with_python_udf_inlining: one-sentence summary plus
  :doc: link to the user guide.
* datafusion.ipc module docstring: cross-reference to the user guide
  for the full pattern.

The crate-level codec.rs module rustdoc also updates "pure-Python
scalar UDFs" to "scalar / aggregate / window UDFs" now that all three
are covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the pr4-docs-examples branch from 20b3338 to f7550ec Compare May 26, 2026 19:19
…ne UDFs

Reviewer feedback on the Expr-pickle PRs (apache#1544) asked that the
cloudpickle portability caveats be discoverable on the user-facing
page, not only in docstrings. The distributing_work.rst page is the
designated canonical home for the distribution story, so add them here:

* New 'Portability requirements for inline Python UDFs' subsection
  covering the matching-Python-minor-version requirement and the
  by-value vs by-reference import-capture rule (imported modules must
  be importable on the worker).
* Qualify the 'fully portable' Python-UDF bullet to point at the new
  requirements.
* Cross-reference the new subsection from the closure-capture note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title docs: user guide + runnable examples for distributing expressions (4/4) docs: user guide + runnable examples for distributing expressions May 26, 2026
timsaucer added 2 commits May 26, 2026 15:35
Two codec.rs docstrings were reworded in PR4 in ways that dropped
information:

* try_encode_python_scalar_udf: restore the `DFPYUDF` family prefix +
  version byte description of the payload framing (PR4 had collapsed it
  to `DFPYUDF1` prefix, dropping the version-byte mention).
* cloudpickle cached-handle comment: restore "The encode/decode helpers
  above" wording.
The 'Worker layout' docstring described tasks as `(expr, label)` but
the code builds and unpacks them as `(label, expr)`. Correct the doc
to match.
@timsaucer timsaucer marked this pull request as ready for review May 26, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow pickling PyExpr

1 participant