fix: deduplicate BTree flat page row addresses by majin1102 · Pull Request #7235 · lance-format/lance

majin1102 · 2026-06-11T09:13:16Z

Summary

BTree segment merge preserves value ordering with UnionExec -> SortPreservingMergeExec(value ASC) -> train_btree_index. However, when an updated row contributes the same indexed value again, the old segment row and the new delta row can both survive as duplicate (value, row address) entries. Later, loading the flat page sorts row addresses and builds RowAddrTreeMap, which rejects duplicate row addresses with the misleading from_sorted_iter called with non-sorted input error.

This PR fixes the write path during BTree segment merge. After the existing value-ordered merge, it scans each equal-value group and drops duplicate row addresses for that value before training the new BTree pages. The read path stays unchanged, and the merge still avoids an extra global sort.

Why this approach

The bug is produced while writing merged BTree segments, so the fix belongs in BTreeIndex::merge_segments().
Deduplicating only exact (value, row address) duplicates is the minimal behavior needed for this repro.
A simple previous-row check is not sufficient because the merge is ordered by value only; duplicate row addresses can be separated by other rows with the same value.
Adding row_id as a secondary sort key would be heavier than a linear scan over the already value-ordered stream.

Tests

cargo fmt --all
cargo test -p lance-index scalar::btree -- --nocapture
cargo test -p lance test_btree_merge_deduplicates_row_addrs -- --nocapture
cargo clippy -p lance-index --tests -- -D warnings
git diff --check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6c2fbe6f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T09:18:43Z

+        if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? {
+            return Ok(archive);


Merge retained archives before writing the next archive

When max_entries is reached, load_or_new seeds the next write from only the newest archive file, even though max_archive_files keeps older archive files around. In repeated cleanups, entries that were drained from the newest file but still exist in an older retained archive are not carried forward; once cleanup_old_archives deletes that older file, metadata for cleaned versions is lost despite the retention setting. Use the existing all-file scan/merge path (or equivalent) when preparing the archive to flush.

Useful? React with 👍 / 👎.

codecov · 2026-06-11T12:28:40Z

Codecov Report

❌ Patch coverage is 83.78378% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/btree.rs	83.78%	1 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5981bc162c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T08:39:38Z

+) -> SendableRecordBatchStream {
+    let schema = stream.schema();
+    let deduplicated = stream::try_unfold(
+        (stream, None::<ScalarValue>, HashSet::<u64>::new()),


Bound dedup state for low-cardinality BTree merges

When optimizing/merging BTree segments for a low-cardinality indexed column (for example boolean, all-null, or a hot key), this HashSet is kept for the entire equal-value group and only cleared when the value changes. Since the merged stream is ordered only by value, a single group can span the whole dataset, so optimize_indices now holds O(group size) row ids in memory instead of streaming page-sized batches; large indexes with one dominant value can OOM during routine segment merge. Use a bounded strategy such as ordering by (value, row_id) and adjacent duplicate removal, or otherwise avoid retaining every row id for the value group.

Useful? React with 👍 / 👎.

wjones127

This doesn't feel like the right solution. Two problems I see:

The real bug seems to me that training would have any duplicates in the input.
This only handles the case where the old and new scalar value for that row id are the same. But we could also have bugs where the value has been updated, and thus the row id is duplicated but across different unique values. Again, I think this should be properly fixed by (1). As you point out, deduplicating across the whole input would be expensive.

majin1102 · 2026-06-22T10:18:26Z

This doesn't feel like the right solution. Two problems I see:

The real bug seems to me that training would have any duplicates in the input.

This only handles the case where the old and new scalar value for that row id are the same. But we could also have bugs where the value has been updated, and thus the row id is duplicated but across different unique values. Again, I think this should be properly fixed by (1). As you point out, deduplicating across the whole input would be expensive.

Ye, sorry I did't get this clearly.
This has been fixed by #7320

github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-ci CI / build workflows bug Something isn't working labels Jun 11, 2026

majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from b6c2fbe to c7051a2 Compare June 11, 2026 09:14

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 2 times, most recently from 2e90802 to 5150dcc Compare June 11, 2026 11:46

majin1102 marked this pull request as draft June 11, 2026 11:57

majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 4 times, most recently from 853bb57 to d4ffd8a Compare June 12, 2026 03:29

majin.nathan added 3 commits June 12, 2026 21:41

fix: deduplicate BTree flat page row addresses

4d67dae

fix: keep BTree read path unchanged

01c5e33

fix: deduplicate BTree merge rows by value

39578ca

majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from a335792 to 39578ca Compare June 12, 2026 13:59

Merge branch 'main' into codex/fix-btree-duplicate-row-addresses

5981bc1

majin1102 marked this pull request as ready for review June 13, 2026 08:35

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

wjones127 reviewed Jun 16, 2026

View reviewed changes

majin1102 closed this Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: deduplicate BTree flat page row addresses#7235

fix: deduplicate BTree flat page row addresses#7235
majin1102 wants to merge 4 commits into
lance-format:mainfrom
majin1102:codex/fix-btree-duplicate-row-addresses

majin1102 commented Jun 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

wjones127 left a comment

Uh oh!

majin1102 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? {
		return Ok(archive);

Uh oh!

Conversation

majin1102 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this approach

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

majin1102 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

majin1102 commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading