Skip to content

SDSTOR-21465: scrubber phase 1#413

Open
JacksonYao287 wants to merge 2 commits into
eBay:stable/v4.xfrom
JacksonYao287:scrubber-phase-1
Open

SDSTOR-21465: scrubber phase 1#413
JacksonYao287 wants to merge 2 commits into
eBay:stable/v4.xfrom
JacksonYao287:scrubber-phase-1

Conversation

@JacksonYao287

Copy link
Copy Markdown
Collaborator

No description provided.

@codecov-commenter

codecov-commenter commented May 7, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 50.59367% with 749 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/scrub_manager.cpp 56.85% 343 Missing and 85 partials ⚠️
src/lib/homestore_backend/hs_http_manager.cpp 0.42% 237 Missing ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 70.24% 22 Missing and 14 partials ⚠️
src/lib/homestore_backend/hs_http_manager.hpp 0.00% 20 Missing ⚠️
src/lib/homestore_backend/scrub_manager.hpp 72.97% 20 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp 81.25% 2 Missing and 1 partial ⚠️
src/lib/homestore_backend/MPMCPriorityQueue.hpp 94.59% 1 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_homeobject.cpp 66.66% 0 Missing and 2 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp 88.88% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #413   +/-   ##
==============================================
  Coverage               ?   52.82%           
==============================================
  Files                  ?       39           
  Lines                  ?     6893           
  Branches               ?      942           
==============================================
  Hits                   ?     3641           
  Misses                 ?     2849           
  Partials               ?      403           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 3 times, most recently from b25a7a3 to 83d0375 Compare May 8, 2026 08:56
"reconcile check: shard {} confirmed absent on peer {}, removing from "
"existence-tracking set",
shard_id, peer_id);
scrub_report->remove_shard_existence_from_peer(shard_id, peer_id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be an iterator invalidation risk here. missing_shards is returned by const reference, so this loop is walking the original map. If remove_shard_existence_from_peer() erases the current shard_id from that same map, the current range-for element / iterator could be invalidated.

Might be safer to iterate over a snapshot copy, or apply removals after the traversal.

@JacksonYao287 JacksonYao287 May 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice finding, but after checking the code, seems it is safe.

line 1439, use copy constructor to get a new instance missing_shards (this is just like the snapshot you metioned).

const auto missing_shards = scrub_report->get_missing_shard_ids();

so, actually , the behavior is we iterate on a copied new instance of missing_shards( which is never changed), and the removal happens in the source missing_shards( inside scrub_report). since they are totally different missing_shards, so it`s totally safe.

pls correct me if I am wrong

@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 2 times, most recently from 1cec0d0 to 96736e5 Compare May 17, 2026 03:49
@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 6 times, most recently from fe2f1c0 to dfdb099 Compare June 8, 2026 06:27
}

void ScrubManager::add_scrub_req(std::shared_ptr< scrub_req > req) {
m_scrub_req_executor->add([this, req = std::move(req)]() { handle_scrub_req(req); });

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_scrub_req_executor can be null

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPE bug should be fixed before merging, other part LGTM.

The NPE can be triggered if some SM enabled scrubbing during config change/upgrade, then causing cluster wide crashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants