Skip to content

fix: deduplicate BTree flat page row addresses#7235

Closed
majin1102 wants to merge 4 commits into
lance-format:mainfrom
majin1102:codex/fix-btree-duplicate-row-addresses
Closed

fix: deduplicate BTree flat page row addresses#7235
majin1102 wants to merge 4 commits into
lance-format:mainfrom
majin1102:codex/fix-btree-duplicate-row-addresses

Conversation

@majin1102

@majin1102 majin1102 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #7230.

BTree segment merge preserves value ordering with UnionExec -> SortPreservingMergeExec(value ASC) -> train_btree_index. However, when an updated row contributes the same indexed value again, the old segment row and the new delta row can both survive as duplicate (value, row address) entries. Later, loading the flat page sorts row addresses and builds RowAddrTreeMap, which rejects duplicate row addresses with the misleading from_sorted_iter called with non-sorted input error.

This PR fixes the write path during BTree segment merge. After the existing value-ordered merge, it scans each equal-value group and drops duplicate row addresses for that value before training the new BTree pages. The read path stays unchanged, and the merge still avoids an extra global sort.

Why this approach

  • The bug is produced while writing merged BTree segments, so the fix belongs in BTreeIndex::merge_segments().
  • Deduplicating only exact (value, row address) duplicates is the minimal behavior needed for this repro.
  • A simple previous-row check is not sufficient because the merge is ordered by value only; duplicate row addresses can be separated by other rows with the same value.
  • Adding row_id as a secondary sort key would be heavier than a linear scan over the already value-ordered stream.

Tests

  • cargo fmt --all
  • cargo test -p lance-index scalar::btree -- --nocapture
  • cargo test -p lance test_btree_merge_deduplicates_row_addrs -- --nocapture
  • cargo clippy -p lance-index --tests -- -D warnings
  • git diff --check

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-ci CI / build workflows bug Something isn't working labels Jun 11, 2026
@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from b6c2fbe to c7051a2 Compare June 11, 2026 09:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6c2fbe6f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread rust/lance/src/dataset/archive.rs Outdated
Comment on lines +537 to +538
if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? {
return Ok(archive);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Merge retained archives before writing the next archive

When max_entries is reached, load_or_new seeds the next write from only the newest archive file, even though max_archive_files keeps older archive files around. In repeated cleanups, entries that were drained from the newest file but still exist in an older retained archive are not carried forward; once cleanup_old_archives deletes that older file, metadata for cleaned versions is lost despite the retention setting. Use the existing all-file scan/merge path (or equivalent) when preparing the archive to flush.

Useful? React with 👍 / 👎.

@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 2 times, most recently from 2e90802 to 5150dcc Compare June 11, 2026 11:46
@majin1102 majin1102 marked this pull request as draft June 11, 2026 11:57
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.78378% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 83.78% 1 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 4 times, most recently from 853bb57 to d4ffd8a Compare June 12, 2026 03:29
@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from a335792 to 39578ca Compare June 12, 2026 13:59
@majin1102 majin1102 marked this pull request as ready for review June 13, 2026 08:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5981bc162c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

) -> SendableRecordBatchStream {
let schema = stream.schema();
let deduplicated = stream::try_unfold(
(stream, None::<ScalarValue>, HashSet::<u64>::new()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound dedup state for low-cardinality BTree merges

When optimizing/merging BTree segments for a low-cardinality indexed column (for example boolean, all-null, or a hot key), this HashSet is kept for the entire equal-value group and only cleared when the value changes. Since the merged stream is ordered only by value, a single group can span the whole dataset, so optimize_indices now holds O(group size) row ids in memory instead of streaming page-sized batches; large indexes with one dominant value can OOM during routine segment merge. Use a bounded strategy such as ordering by (value, row_id) and adjacent duplicate removal, or otherwise avoid retaining every row id for the value group.

Useful? React with 👍 / 👎.

@wjones127 wjones127 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like the right solution. Two problems I see:

  1. The real bug seems to me that training would have any duplicates in the input.
  2. This only handles the case where the old and new scalar value for that row id are the same. But we could also have bugs where the value has been updated, and thus the row id is duplicated but across different unique values. Again, I think this should be properly fixed by (1). As you point out, deduplicating across the whole input would be expensive.

@majin1102

Copy link
Copy Markdown
Contributor Author

This doesn't feel like the right solution. Two problems I see:

  1. The real bug seems to me that training would have any duplicates in the input.
  2. This only handles the case where the old and new scalar value for that row id are the same. But we could also have bugs where the value has been updated, and thus the row id is duplicated but across different unique values. Again, I think this should be properly fixed by (1). As you point out, deduplicating across the whole input would be expensive.

Ye, sorry I did't get this clearly.
This has been fixed by #7320

@majin1102 majin1102 closed this Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci CI / build workflows A-index Vector index, linalg, tokenizer A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BTree scalar index can contain duplicate row addresses

2 participants