Skip to content

perf(mem_wal): parallelize fresh-tier source planning and execution#7257

Merged
hamersaw merged 2 commits into
lance-format:mainfrom
hamersaw:perf/wal-parallelize-sources
Jun 16, 2026
Merged

perf(mem_wal): parallelize fresh-tier source planning and execution#7257
hamersaw merged 2 commits into
lance-format:mainfrom
hamersaw:perf/wal-parallelize-sources

Conversation

@hamersaw

Copy link
Copy Markdown
Contributor

Summary

The LSM FTS and vector search planners (LsmFtsSearchPlanner, LsmVectorSearchPlanner) built each source's plan in a sequential for loop and unioned the arms under a single SortPreservingMergeExec. The merge polls every union arm from one task, so per-arm CPU (posting/index decode, BM25 and distance scoring) serialized even though the underlying IO awaits interleave — wall time grew linearly with the flushed-generation count.

This:

  • builds the per-source plans concurrently with try_join_all (FTS + vector),
  • runs the cross-source block-list PK hashing concurrently (block_list.rs),
  • wraps the union in a round-robin RepartitionExec via a new spawn_union_arms helper so each arm gets its own driver task.

Rows stay disjoint across partitions, so the per-partition TopK + sort-preserving merge semantics are unchanged.

Changes

  • scanner/exec.rs: spawn_union_arms helper (round-robin repartition over the union).
  • scanner/fts_search.rs, scanner/vector_search.rs: concurrent per-source plan builds + spawn_union_arms over the union.
  • scanner/block_list.rs: concurrent flushed-generation PK-hash loads.

Validation

Validated end-to-end against a WAL FTS benchmark on minikube with object storage behind a 10ms/GET latency proxy. Read latency over a fresh tier as a function of flushed-generation count, p50:

generations before after
2 1,164ms 565ms
5 1,983ms 610ms
10 3,585ms 660ms
18 6,071ms 759ms

Per-generation slope dropped from ~290ms/gen to ~12ms/gen.

🤖 Generated with Claude Code

The LSM FTS and vector search planners built each source's plan in a
sequential `for` loop and unioned the arms under a single
`SortPreservingMergeExec`. The merge polls every union arm from one task,
so per-arm CPU (posting/index decode, BM25 and distance scoring) serialized
even though the underlying IO awaits interleave — wall time grew linearly
with the flushed-generation count.

Build the per-source plans concurrently with `try_join_all`, run the
cross-source block-list PK hashing concurrently, and wrap the union in a
round-robin `RepartitionExec` (`spawn_union_arms`) so each arm gets its own
driver task. Rows stay disjoint across partitions, so the per-partition
TopK + sort-preserving merge semantics are unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hamersaw hamersaw force-pushed the perf/wal-parallelize-sources branch from 026dcde to b4b6bbf Compare June 15, 2026 15:21
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.07143% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...st/lance/src/dataset/mem_wal/scanner/block_list.rs 80.00% 2 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

) -> lance_core::Result<std::sync::Arc<dyn datafusion::physical_plan::ExecutionPlan>> {
use datafusion::physical_plan::repartition::RepartitionExec;
let n = union.properties().partitioning.partition_count();
let repart = RepartitionExec::try_new(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepartitionExec may be a little heavy to solve concurrency issue? it's shuffle
does it improve the result a lot?
If we already do per source topK (seems like, could you confirm?), the final result candidates may not be that many? and can use SortExec directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your right, the RepartitionExec is a bit heavy for what we actually want to achieve here. I removed this and am now relying on the underlying constructs to give us execution parallelism.

The fresh-tier FTS and vector planners wrapped the per-source union in a
round-robin RepartitionExec to give each arm its own driver task. That
fan-out is redundant: the downstream SortPreservingMergeExec already
spawns one task per input partition (one per union arm) via
spawn_buffered on the multi-thread runtime, and the heavy per-arm CPU
(IVF_HNSW partition search, BM25/WAND scoring) already runs on the CPU
pool via spawn_cpu at the leaf. The extra exchange only added a channel
hop and a round-robin reshuffle for no measurable gain.

Remove spawn_union_arms and return the UnionExec directly. The
concurrent per-source plan building (try_join_all) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw requested a review from LuQQiu June 16, 2026 12:27
@hamersaw hamersaw merged commit 27570a3 into lance-format:main Jun 16, 2026
31 checks passed
@hamersaw hamersaw deleted the perf/wal-parallelize-sources branch June 16, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants