fix: deduplicate BTree flat page row addresses#7235
Conversation
b6c2fbe to
c7051a2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6c2fbe6f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? { | ||
| return Ok(archive); |
There was a problem hiding this comment.
Merge retained archives before writing the next archive
When max_entries is reached, load_or_new seeds the next write from only the newest archive file, even though max_archive_files keeps older archive files around. In repeated cleanups, entries that were drained from the newest file but still exist in an older retained archive are not carried forward; once cleanup_old_archives deletes that older file, metadata for cleaned versions is lost despite the retention setting. Use the existing all-file scan/merge path (or equivalent) when preparing the archive to flush.
Useful? React with 👍 / 👎.
2e90802 to
5150dcc
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
853bb57 to
d4ffd8a
Compare
a335792 to
39578ca
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5981bc162c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) -> SendableRecordBatchStream { | ||
| let schema = stream.schema(); | ||
| let deduplicated = stream::try_unfold( | ||
| (stream, None::<ScalarValue>, HashSet::<u64>::new()), |
There was a problem hiding this comment.
Bound dedup state for low-cardinality BTree merges
When optimizing/merging BTree segments for a low-cardinality indexed column (for example boolean, all-null, or a hot key), this HashSet is kept for the entire equal-value group and only cleared when the value changes. Since the merged stream is ordered only by value, a single group can span the whole dataset, so optimize_indices now holds O(group size) row ids in memory instead of streaming page-sized batches; large indexes with one dominant value can OOM during routine segment merge. Use a bounded strategy such as ordering by (value, row_id) and adjacent duplicate removal, or otherwise avoid retaining every row id for the value group.
Useful? React with 👍 / 👎.
wjones127
left a comment
There was a problem hiding this comment.
This doesn't feel like the right solution. Two problems I see:
- The real bug seems to me that training would have any duplicates in the input.
- This only handles the case where the old and new scalar value for that row id are the same. But we could also have bugs where the value has been updated, and thus the row id is duplicated but across different unique values. Again, I think this should be properly fixed by (1). As you point out, deduplicating across the whole input would be expensive.
Ye, sorry I did't get this clearly. |
Summary
Fixes #7230.
BTree segment merge preserves value ordering with
UnionExec -> SortPreservingMergeExec(value ASC) -> train_btree_index. However, when an updated row contributes the same indexed value again, the old segment row and the new delta row can both survive as duplicate(value, row address)entries. Later, loading the flat page sorts row addresses and buildsRowAddrTreeMap, which rejects duplicate row addresses with the misleadingfrom_sorted_iter called with non-sorted inputerror.This PR fixes the write path during BTree segment merge. After the existing value-ordered merge, it scans each equal-value group and drops duplicate row addresses for that value before training the new BTree pages. The read path stays unchanged, and the merge still avoids an extra global sort.
Why this approach
BTreeIndex::merge_segments().(value, row address)duplicates is the minimal behavior needed for this repro.row_idas a secondary sort key would be heavier than a linear scan over the already value-ordered stream.Tests
cargo fmt --allcargo test -p lance-index scalar::btree -- --nocapturecargo test -p lance test_btree_merge_deduplicates_row_addrs -- --nocapturecargo clippy -p lance-index --tests -- -D warningsgit diff --check