perf(mem_wal): parallelize fresh-tier source planning and execution by hamersaw · Pull Request #7257 · lance-format/lance

hamersaw · 2026-06-12T20:01:46Z

Summary

The LSM FTS and vector search planners (LsmFtsSearchPlanner, LsmVectorSearchPlanner) built each source's plan in a sequential for loop and unioned the arms under a single SortPreservingMergeExec. The merge polls every union arm from one task, so per-arm CPU (posting/index decode, BM25 and distance scoring) serialized even though the underlying IO awaits interleave — wall time grew linearly with the flushed-generation count.

This:

builds the per-source plans concurrently with try_join_all (FTS + vector),
runs the cross-source block-list PK hashing concurrently (block_list.rs),
wraps the union in a round-robin RepartitionExec via a new spawn_union_arms helper so each arm gets its own driver task.

Rows stay disjoint across partitions, so the per-partition TopK + sort-preserving merge semantics are unchanged.

Changes

scanner/exec.rs: spawn_union_arms helper (round-robin repartition over the union).
scanner/fts_search.rs, scanner/vector_search.rs: concurrent per-source plan builds + spawn_union_arms over the union.
scanner/block_list.rs: concurrent flushed-generation PK-hash loads.

Validation

Validated end-to-end against a WAL FTS benchmark on minikube with object storage behind a 10ms/GET latency proxy. Read latency over a fresh tier as a function of flushed-generation count, p50:

generations	before	after
2	1,164ms	565ms
5	1,983ms	610ms
10	3,585ms	660ms
18	6,071ms	759ms

Per-generation slope dropped from ~290ms/gen to ~12ms/gen.

🤖 Generated with Claude Code

The LSM FTS and vector search planners built each source's plan in a sequential `for` loop and unioned the arms under a single `SortPreservingMergeExec`. The merge polls every union arm from one task, so per-arm CPU (posting/index decode, BM25 and distance scoring) serialized even though the underlying IO awaits interleave — wall time grew linearly with the flushed-generation count. Build the per-source plans concurrently with `try_join_all`, run the cross-source block-list PK hashing concurrently, and wrap the union in a round-robin `RepartitionExec` (`spawn_union_arms`) so each arm gets its own driver task. Rows stay disjoint across partitions, so the per-partition TopK + sort-preserving merge semantics are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-15T15:59:34Z

Codecov Report

❌ Patch coverage is 91.07143% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...st/lance/src/dataset/mem_wal/scanner/block_list.rs	80.00%	2 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

LuQQiu · 2026-06-15T22:46:31Z

+) -> lance_core::Result<std::sync::Arc<dyn datafusion::physical_plan::ExecutionPlan>> {
+    use datafusion::physical_plan::repartition::RepartitionExec;
+    let n = union.properties().partitioning.partition_count();
+    let repart = RepartitionExec::try_new(


RepartitionExec may be a little heavy to solve concurrency issue? it's shuffle
does it improve the result a lot?
If we already do per source topK (seems like, could you confirm?), the final result candidates may not be that many? and can use SortExec directly?

Your right, the RepartitionExec is a bit heavy for what we actually want to achieve here. I removed this and am now relying on the underlying constructs to give us execution parallelism.

The fresh-tier FTS and vector planners wrapped the per-source union in a round-robin RepartitionExec to give each arm its own driver task. That fan-out is redundant: the downstream SortPreservingMergeExec already spawns one task per input partition (one per union arm) via spawn_buffered on the multi-thread runtime, and the heavy per-arm CPU (IVF_HNSW partition search, BM25/WAND scoring) already runs on the CPU pool via spawn_cpu at the leaf. The extra exchange only added a channel hop and a round-robin reshuffle for no measurable gain. Remove spawn_union_arms and return the UnionExec directly. The concurrent per-source plan building (try_join_all) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added the performance label Jun 12, 2026

hamersaw force-pushed the perf/wal-parallelize-sources branch from 026dcde to b4b6bbf Compare June 15, 2026 15:21

LuQQiu reviewed Jun 15, 2026

View reviewed changes

hamersaw requested a review from LuQQiu June 16, 2026 12:27

LuQQiu approved these changes Jun 16, 2026

View reviewed changes

hamersaw merged commit 27570a3 into lance-format:main Jun 16, 2026
31 checks passed

hamersaw deleted the perf/wal-parallelize-sources branch June 16, 2026 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mem_wal): parallelize fresh-tier source planning and execution#7257

perf(mem_wal): parallelize fresh-tier source planning and execution#7257
hamersaw merged 2 commits into
lance-format:mainfrom
hamersaw:perf/wal-parallelize-sources

hamersaw commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

LuQQiu Jun 15, 2026

Uh oh!

hamersaw Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamersaw commented Jun 12, 2026

Summary

Changes

Validation

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LuQQiu Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 15, 2026 •

edited

Loading