fix: merge_insert silently drops matches when a leading payload column is all-null#7251
Open
Ar-maan05 wants to merge 2 commits into
Open
fix: merge_insert silently drops matches when a leading payload column is all-null#7251Ar-maan05 wants to merge 2 commits into
Ar-maan05 wants to merge 2 commits into
Conversation
…n is all-null ## Problem A partial-schema `merge_insert` (`when_matched_update_all`) against a table that has a scalar index on the join key can silently update **0 rows** — no error, no warning — when the first column of the source is all-null. Dropping the index makes it work again. Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177). Minimal repro (from the lancedb issue): ```python schema = pa.schema([ pa.field("vector", pa.list_(pa.float32(), 4), nullable=True), # all None pa.field("path", pa.string(), nullable=False), # join key pa.field("status", pa.utf8()), pa.field("file_size", pa.int64()), ]) tbl = db.create_table("test", schema=schema) tbl.add(...) # 1000 rows, vector = None tbl.create_scalar_index("path", index_type="BTREE") tbl.merge_insert("path").when_matched_update_all().execute(updates) # 128 rows # -> num_updated_rows == 0 (expected 128) ``` ## Root cause A scalar index on the join key routes the merge through the legacy `Merger` (see `can_use_create_plan`: `would_use_scalar_index` disables the v2 fast path). The `Merger` reads a full-outer-join stream and, for each row, decides whether the row came from the source side, the target side, or both, by checking whether the join **keys** are NULL-padded. But `extract_selections` checked the columns at positions `[0, num_keys)` instead of the actual key columns: ```rust let in_left = Self::not_all_null(combined_batch, 0, num_keys)?; let in_right = Self::not_all_null(combined_batch, right_offset, num_keys)?; ``` This assumes the key columns are physically first. They are not: a partial-schema source preserves the user's column order, so here column 0 is `vector`. On the target side that column is all-null (the original rows were inserted with `vector = None`), so `in_right` was `false` for **every matched row** → `in_both` empty → 0 updates, silently. The existing full-schema indexed test only passed by luck: its column 0 happened to be non-null on both sides. ## Fix Locate the join-key columns by name and test those (the target half carries the same columns in the same order, offset by `right_offset`): ```rust let source_key_cols = self.params.on.iter() .map(|key| combined_batch.schema().index_of(key))...; let target_key_cols = source_key_cols.iter().map(|c| c + right_offset)...; let in_left = Self::not_all_null(combined_batch, &source_key_cols)?; let in_right = Self::not_all_null(combined_batch, &target_key_cols)?; ``` `not_all_null` now takes an explicit column-index slice instead of a contiguous `(offset, len)` range. ## Tests Added `test_repro_3515_partial_schema_fully_indexed`, parameterized over storage versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column, scalar index covering every fragment, partial-schema update). It fails on `main` (0 updates) and passes with the fix. All 143 tests in the `merge_insert` module pass; `cargo fmt --all --check` and `cargo clippy -p lance` are clean.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: merge_insert silently drops matches when a leading payload column is all-null
Problem
A partial-schema
merge_insert(when_matched_update_all) against a table that has a scalar index on the join key can silently update 0 rows, no error, no warning, when the first column of the source is all-null. Dropping the index makes it work again.Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177).
Minimal repro (from the lancedb issue):
Root cause
A scalar index on the join key routes the merge through the legacy
Merger(seecan_use_create_plan:would_use_scalar_indexdisables the v2 fast path). TheMergerreads a full-outer-join stream and, for each row, decides whether the row came from the source side, the target side, or both, by checking whether the join keys are NULL-padded.But
extract_selectionschecked the columns at positions[0, num_keys)instead of the actual key columns:This assumes the key columns are physically first. They are not: a partial-schema source preserves the user's column order, so here column 0 is
vector. On the target side that column is all-null (the original rows were inserted withvector = None), soin_rightwasfalsefor every matched row ->in_bothempty -> 0 updates, silently.The existing full-schema indexed test only passed by luck: its column 0 happened to be non-null on both sides.
Fix
Locate the join-key columns by name and test those (the target half carries the same columns in the same order, offset by
right_offset):not_all_nullnow takes an explicit column-index slice instead of a contiguous(offset, len)range.Tests
Added
test_repro_3515_partial_schema_fully_indexed, parameterized over storage versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column, scalar index covering every fragment, partial-schema update). It fails onmain(0 updates) and passes with the fix.All 143 tests in the
merge_insertmodule pass;cargo fmt --all --checkandcargo clippy -p lanceare clean.