Establish a mutex-like claim on the hold cluster before MRW scanning #79

karalekas · 2025-07-13T08:16:07Z

The original syndrome was a reweight by zero indefinitely, e.g.:

2675.: [#<D-S 573423>] got HOLD 0 #<23776>--->#<22081> from #<23776> w/ root-bucket (#<22120>)
2675.: [#<D-S 573532>] got HOLD 0 #<19540>--->#<19560> from #<19575> w/ root-bucket (#<19585>)
2762.: [#<D-S 586941>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
2923.: [#<D-S 586941>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
2940.: [#<D-S 586941>] closing with success
2958.: [#<D-S 615460>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
3073.: [#<D-S 615460>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
3091.: [#<D-S 615460>] closing with success
3151.: [#<D-S 573532>] reweighting roots (#<19585> #<22132> #<15017> #<18154> #<12197> #<15015> #<23776> #<22120> #<25451> #<19570> #<14973> #<18122> #<22118> #<19575>) by 0
3181.: [#<D-S 573532>] closing with success
3192.: [#<D-S 646635>] got HOLD 0 #<15015>--->#<19592> from #<15015> w/ root-bucket (#<12197> #<22118> #<18154> #<15017>)
3192.: [#<D-S 646665>] got HOLD 0 #<19540>--->#<19560> from #<19575> w/ root-bucket (#<19585>)
3302.: [#<D-S 646665>] reweighting roots (#<19585> #<22132> #<15017> #<18154> #<12197> #<15015> #<23776> #<22120> #<25451> #<19570> #<14973> #<18122> #<22118> #<19575>) by 0
3334.: [#<D-S 646665>] closing with success
3418.: [#<D-S 676192>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
3445.: [#<D-S 646635>] reweighting roots (#<18154> #<18122> #<19585> #<19575> #<14973> #<19570> #<25451> #<22120> #<23776> #<22132> #<22118> #<12197> #<15017> #<15015>) by 0
3464.: [#<D-S 646635>] closing with success
3544.: [#<D-S 692400>] got HOLD 0 #<15015>--->#<19592> from #<15015> w/ root-bucket (#<12197> #<22118> #<18154> #<15017>)
3581.: [#<D-S 676192>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
3615.: [#<D-S 676192>] closing with success
3625.: [#<D-S 703377>] got HOLD 0 #<22132>--->#<23786> from #<22132> w/ root-bucket (#<15017> #<25451> #<22118> #<22120> #<23776>)
3743.: [#<D-S 703377>] reweighting roots (#<25451> #<22120> #<22118> #<18122> #<19585> #<19575> #<14973> #<19570> #<15017> #<18154> #<12197> #<15015> #<23776> #<22132>) by 0
3777.: [#<D-S 703377>] closing with success
3908.: [#<D-S 692400>] reweighting roots (#<18154> #<18122> #<19585> #<19575> #<14973> #<19570> #<25451> #<22120> #<23776> #<22132> #<22118> #<12197> #<15017> #<15015>) by 0
3945.: [#<D-S 692400>] closing with success
4072.: [#<D-S 573423>] reweighting roots (#<22120> #<22118> #<18122> #<19585> #<19575> #<14973> #<19570> #<15017> #<18154> #<12197> #<15015> #<25451> #<22132> #<23776>) by 0
4119.: [#<D-S 573423>] closing with success
...

After telling the supervisor to abort in this case, we still get a livelock. We first explored scanning wider next time (via injecting SCAN-LOOP t into the source-root's command stack) so that the time between scan-interruptions increases, giving a heavily-interrupted root time to get to the end of its own self-initiated scan. This, however, just makes the livelock less likely but doesn't eliminate it completely.

The true "root cause" is that, as the problem graph gets denser, hold clusters get bigger, and then we get a combinatorial explosion of scan-interrupts that starves some of the roots / results in a livelock-like scenario.

Thus, we need a way to ensure only one root in the hold cluster scans the cluster. However, we can’t use the regular lock because need to also lock the potential target-root and its cluster once we get to the inner critical section AND either of the following lock-based approaches has flaws:

if we just extend the lock, we open ourselves up to livelock between the cluster and the target
if we instead unlock the cluster and then re-lock the cluster and the target, we avoid (1), but open ourselves up to livelock within the cluster (another root in our cluster could get to the first lock before we get to the second lock, etc.)

This implies that we need a new mechanism to ensure only one root scans the cluster. Implementing a lower-tier form of BROADCAST-LOCK is one idea, but this is potentially quite hairy (having a higher-tier lock blow away a lower-tier lock without leaving hanging pointers/latches). Luckily, the MULTIREWEIGHT-BROADCAST-SCAN only targets the roots in the hold cluster, and it’s okay if the scans get interrupted by an external standard lock, so I we can just set/unset a flag on the roots. We call this "claiming" and "releasing" the set of roots in the hold-cluster, and it follows a pattern similar to BROADCAST-LOCK/BROADCAST-UNLOCK. At any point in time, only one root in the cluster can have a claim over the cluster.

ecpeterson

I think I feel OK about this. In just the 12h since we spoke, I've already forgotten some of the context — e.g., I can't tell where SCAN interruption fighting has disappeared to — but then again I could barely keep it in my head in the moment. So, this is a low quality review, but it is also a ✅

src/node.lisp

ecpeterson · 2025-07-15T04:02:14Z

When you were first describing this as "another level of lock", I was most worried about more message non-response guards. I'm very pleased to see that that sort of thing is absent from the ultimate formulation.

If best MRW scan rec has weight zero, abort + scan wider next time

4252db1

karalekas requested a review from ecpeterson July 13, 2025 08:16

karalekas assigned ecpeterson Jul 13, 2025

Switch to a hold-cluster claim + release strategy instead

89c2ddc

karalekas changed the title ~~If best MRW scan rec has weight zero, abort + scan wider next time~~ Establish a mutex-like claim on the hold cluster before MRW scanning Jul 15, 2025

ecpeterson approved these changes Jul 15, 2025

View reviewed changes

src/node.lisp Show resolved Hide resolved

Extend the claimed-roots list rather than resetting it

ce6ff4e

karalekas merged commit f6331a9 into main Jul 15, 2025
1 check passed

karalekas deleted the hold-scan-loop-t branch July 15, 2025 15:48

karalekas mentioned this pull request Jul 16, 2025

Unset held-by-roots for the whole hold-cluster upon MRW success #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Establish a mutex-like claim on the hold cluster before MRW scanning #79

Establish a mutex-like claim on the hold cluster before MRW scanning #79

Uh oh!

karalekas commented Jul 13, 2025 •

edited

Loading

Uh oh!

ecpeterson left a comment

Uh oh!

Uh oh!

ecpeterson commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Establish a mutex-like claim on the hold cluster before MRW scanning #79

Establish a mutex-like claim on the hold cluster before MRW scanning #79

Uh oh!

Conversation

karalekas commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecpeterson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ecpeterson commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

karalekas commented Jul 13, 2025 •

edited

Loading