Skip to content

fix: merge_insert silently drops matches when a leading payload column is all-null#7251

Open
Ar-maan05 wants to merge 2 commits into
lance-format:mainfrom
Ar-maan05:fix/merge-insert-partial-schema-3515
Open

fix: merge_insert silently drops matches when a leading payload column is all-null#7251
Ar-maan05 wants to merge 2 commits into
lance-format:mainfrom
Ar-maan05:fix/merge-insert-partial-schema-3515

Conversation

@Ar-maan05

@Ar-maan05 Ar-maan05 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

fix: merge_insert silently drops matches when a leading payload column is all-null

Problem

A partial-schema merge_insert (when_matched_update_all) against a table that has a scalar index on the join key can silently update 0 rows, no error, no warning, when the first column of the source is all-null. Dropping the index makes it work again.

Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177).

Minimal repro (from the lancedb issue):

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 4), nullable=True),  # all None
    pa.field("path", pa.string(), nullable=False),                 # join key
    pa.field("status", pa.utf8()),
    pa.field("file_size", pa.int64()),
])
tbl = db.create_table("test", schema=schema)
tbl.add(...)                                  # 1000 rows, vector = None
tbl.create_scalar_index("path", index_type="BTREE")
tbl.merge_insert("path").when_matched_update_all().execute(updates)  # 128 rows
# -> num_updated_rows == 0   (expected 128)

Root cause

A scalar index on the join key routes the merge through the legacy Merger (see can_use_create_plan: would_use_scalar_index disables the v2 fast path). The Merger reads a full-outer-join stream and, for each row, decides whether the row came from the source side, the target side, or both, by checking whether the join keys are NULL-padded.

But extract_selections checked the columns at positions [0, num_keys) instead of the actual key columns:

let in_left  = Self::not_all_null(combined_batch, 0, num_keys)?;
let in_right = Self::not_all_null(combined_batch, right_offset, num_keys)?;

This assumes the key columns are physically first. They are not: a partial-schema source preserves the user's column order, so here column 0 is vector. On the target side that column is all-null (the original rows were inserted with vector = None), so in_right was false for every matched row -> in_both empty -> 0 updates, silently.

The existing full-schema indexed test only passed by luck: its column 0 happened to be non-null on both sides.

Fix

Locate the join-key columns by name and test those (the target half carries the same columns in the same order, offset by right_offset):

let source_key_cols = self.params.on.iter()
    .map(|key| combined_batch.schema().index_of(key))...;
let target_key_cols = source_key_cols.iter().map(|c| c + right_offset)...;
let in_left  = Self::not_all_null(combined_batch, &source_key_cols)?;
let in_right = Self::not_all_null(combined_batch, &target_key_cols)?;

not_all_null now takes an explicit column-index slice instead of a contiguous (offset, len) range.

Tests

Added test_repro_3515_partial_schema_fully_indexed, parameterized over storage versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column, scalar index covering every fragment, partial-schema update). It fails on main (0 updates) and passes with the fix.

All 143 tests in the merge_insert module pass; cargo fmt --all --check and cargo clippy -p lance are clean.

…n is all-null

## Problem

A partial-schema `merge_insert` (`when_matched_update_all`) against a table that
has a scalar index on the join key can silently update **0 rows** — no error, no
warning — when the first column of the source is all-null. Dropping the index
makes it work again.

Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177).

Minimal repro (from the lancedb issue):

```python
schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 4), nullable=True),  # all None
    pa.field("path", pa.string(), nullable=False),                 # join key
    pa.field("status", pa.utf8()),
    pa.field("file_size", pa.int64()),
])
tbl = db.create_table("test", schema=schema)
tbl.add(...)                                  # 1000 rows, vector = None
tbl.create_scalar_index("path", index_type="BTREE")
tbl.merge_insert("path").when_matched_update_all().execute(updates)  # 128 rows
# -> num_updated_rows == 0   (expected 128)
```

## Root cause

A scalar index on the join key routes the merge through the legacy `Merger`
(see `can_use_create_plan`: `would_use_scalar_index` disables the v2 fast path).
The `Merger` reads a full-outer-join stream and, for each row, decides whether
the row came from the source side, the target side, or both, by checking whether
the join **keys** are NULL-padded.

But `extract_selections` checked the columns at positions `[0, num_keys)` instead
of the actual key columns:

```rust
let in_left  = Self::not_all_null(combined_batch, 0, num_keys)?;
let in_right = Self::not_all_null(combined_batch, right_offset, num_keys)?;
```

This assumes the key columns are physically first. They are not: a partial-schema
source preserves the user's column order, so here column 0 is `vector`. On the
target side that column is all-null (the original rows were inserted with
`vector = None`), so `in_right` was `false` for **every matched row** →
`in_both` empty → 0 updates, silently.

The existing full-schema indexed test only passed by luck: its column 0 happened
to be non-null on both sides.

## Fix

Locate the join-key columns by name and test those (the target half carries the
same columns in the same order, offset by `right_offset`):

```rust
let source_key_cols = self.params.on.iter()
    .map(|key| combined_batch.schema().index_of(key))...;
let target_key_cols = source_key_cols.iter().map(|c| c + right_offset)...;
let in_left  = Self::not_all_null(combined_batch, &source_key_cols)?;
let in_right = Self::not_all_null(combined_batch, &target_key_cols)?;
```

`not_all_null` now takes an explicit column-index slice instead of a contiguous
`(offset, len)` range.

## Tests

Added `test_repro_3515_partial_schema_fully_indexed`, parameterized over storage
versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column,
scalar index covering every fragment, partial-schema update). It fails on `main`
(0 updates) and passes with the fix.

All 143 tests in the `merge_insert` module pass; `cargo fmt --all --check` and
`cargo clippy -p lance` are clean.
@github-actions github-actions Bot added the bug Something isn't working label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 70.83% 2 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant