SDSTOR-21465: scrubber phase 1#413
Conversation
0f155d8 to
f3cc39a
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stable/v4.x #413 +/- ##
==============================================
Coverage ? 52.82%
==============================================
Files ? 39
Lines ? 6893
Branches ? 942
==============================================
Hits ? 3641
Misses ? 2849
Partials ? 403 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
b25a7a3 to
83d0375
Compare
83d0375 to
fdf3a0c
Compare
| "reconcile check: shard {} confirmed absent on peer {}, removing from " | ||
| "existence-tracking set", | ||
| shard_id, peer_id); | ||
| scrub_report->remove_shard_existence_from_peer(shard_id, peer_id); |
There was a problem hiding this comment.
I think there may be an iterator invalidation risk here. missing_shards is returned by const reference, so this loop is walking the original map. If remove_shard_existence_from_peer() erases the current shard_id from that same map, the current range-for element / iterator could be invalidated.
Might be safer to iterate over a snapshot copy, or apply removals after the traversal.
There was a problem hiding this comment.
nice finding, but after checking the code, seems it is safe.
line 1439, use copy constructor to get a new instance missing_shards (this is just like the snapshot you metioned).
const auto missing_shards = scrub_report->get_missing_shard_ids();
so, actually , the behavior is we iterate on a copied new instance of missing_shards( which is never changed), and the removal happens in the source missing_shards( inside scrub_report). since they are totally different missing_shards, so it`s totally safe.
pls correct me if I am wrong
1cec0d0 to
96736e5
Compare
fe2f1c0 to
dfdb099
Compare
dfdb099 to
4ac2c8a
Compare
| } | ||
|
|
||
| void ScrubManager::add_scrub_req(std::shared_ptr< scrub_req > req) { | ||
| m_scrub_req_executor->add([this, req = std::move(req)]() { handle_scrub_req(req); }); |
There was a problem hiding this comment.
m_scrub_req_executor can be null
xiaoxichen
left a comment
There was a problem hiding this comment.
The NPE bug should be fixed before merging, other part LGTM.
The NPE can be triggered if some SM enabled scrubbing during config change/upgrade, then causing cluster wide crashing.
No description provided.