Skip to content

feat(java): expose ArrowArrayStream export on LanceScanner#7259

Open
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:feat-java-export-arrow-stream
Open

feat(java): expose ArrowArrayStream export on LanceScanner#7259
sezruby wants to merge 2 commits into
lance-format:mainfrom
sezruby:feat-java-export-arrow-stream

Conversation

@sezruby

@sezruby sezruby commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Add LanceScanner#exportArrowStream(ArrowArrayStream) — a public wrapper around the existing private native openStream(long) JNI call. Lets callers populate a stream they allocated themselves instead of going through scanBatches(), which immediately imports the result into a Java ArrowReader backed by Lance's BufferAllocator.

Why

Consumers loaded under a different classloader and/or pinned to a different Apache Arrow version cannot safely share org.apache.arrow.vector.* classes with Lance — the JVM treats them as distinct types even when the bytecode is identical. The C Data Interface struct is stable across Arrow versions, so handing the C struct's memory address across the boundary is the only correct integration shape.

A concrete consumer is the gluten-spark / Velox integration tracked at apache/gluten#12263. gluten-spark builds against Arrow 15 (matching what Spark 3.5 ships and Velox uses); Lance Java SDK is on Arrow 18. With this method, gluten can:

try (ArrowArrayStream stream = ArrowArrayStream.allocateNew(glutenAllocator)) {
  scanner.exportArrowStream(stream);
  try (ArrowReader reader = Data.importArrayStream(glutenAllocator, stream)) {
    // import each batch into Velox via gluten's own Arrow 15 stack
  }
}

…where glutenAllocator is a Spark-task-managed BufferAllocator (ArrowReservationListener plumbing for memory accounting). Lance never sees Java Arrow on this side; ownership stays with the caller via the C Data Interface release callback.

What changed

  • LanceScanner#exportArrowStream(ArrowArrayStream) — new public method, ~7 lines + Javadoc with usage example. Mirrors the body of scanBatches() minus the local stream allocation and the Data.importArrayStream step.
  • No native code touched. The underlying JNI hook already existed; it was just not reachable from outside the class.
  • Test testDatasetScannerExportArrowStream exercises the full path: caller allocates the C stream from its own RootAllocator, scanner fills the C struct, caller imports into an ArrowReader and validates batch contents (40 rows over 2 batches of 20).

Backwards compatibility

Pure addition. scanBatches(), schema(), countRows(), getStats(), close() all unchanged. No native ABI change.

Test plan

  • ./mvnw test -Dtest=ScannerTest#testDatasetScannerExportArrowStream — passes locally (Java compile + spotless clean; full test run depends on a working lance-jni Rust build, which had an unrelated aws-smithy-types registry issue on my machine, so I'm relying on CI for the JNI-linked verification).
  • Existing testDatasetScannerColumns covers the scanBatches() path so any regression in the shared openStream JNI call would surface there.

Add public LanceScanner#exportArrowStream(ArrowArrayStream) wrapping the
existing private native openStream(long) call. Lets callers populate a
stream they allocated themselves (typically from their own BufferAllocator)
instead of going through scanBatches(), which immediately imports into a
Java ArrowReader backed by Lance's allocator.

The motivation is consumers loaded under a different classloader / pinned
to a different Apache Arrow version. Sharing org.apache.arrow.vector.*
classes across classloader boundaries is not safe, but the C Data Interface
struct is stable across Arrow versions — so handing the C struct's memory
address through is the only correct integration boundary.

A concrete consumer is the ongoing gluten-spark/Velox integration tracked
at apache/gluten#12263, which needs to import Lance scan output into its
own Arrow 15 + Velox runtime; gluten-spark is built against Arrow 15 while
Lance is on Arrow 18.

Test exercises the full path end-to-end: caller allocates a stream from
its own RootAllocator, scanner fills the C struct, caller imports into an
ArrowReader and validates batch contents.
@github-actions github-actions Bot added A-java Java bindings + JNI enhancement New feature or request labels Jun 12, 2026
@sezruby

sezruby commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@hamersaw @jackye1995 Could you review the PR? The PR is to support Lance Reader in Gluten/Spark. I'll open a lance-spark PR after this PR is merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant