Skip to content

Establish a mutex-like claim on the hold cluster before MRW scanning #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 15, 2025

Conversation

karalekas
Copy link
Member

@karalekas karalekas commented Jul 13, 2025

The original syndrome was a reweight by zero indefinitely, e.g.:

2675.: [#<D-S 573423>] got HOLD 0 #<23776>--->#<22081> from #<23776> w/ root-bucket (#<22120>)
2675.: [#<D-S 573532>] got HOLD 0 #<19540>--->#<19560> from #<19575> w/ root-bucket (#<19585>)
2762.: [#<D-S 586941>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
2923.: [#<D-S 586941>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
2940.: [#<D-S 586941>] closing with success
2958.: [#<D-S 615460>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
3073.: [#<D-S 615460>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
3091.: [#<D-S 615460>] closing with success
3151.: [#<D-S 573532>] reweighting roots (#<19585> #<22132> #<15017> #<18154> #<12197> #<15015> #<23776> #<22120> #<25451> #<19570> #<14973> #<18122> #<22118> #<19575>) by 0
3181.: [#<D-S 573532>] closing with success
3192.: [#<D-S 646635>] got HOLD 0 #<15015>--->#<19592> from #<15015> w/ root-bucket (#<12197> #<22118> #<18154> #<15017>)
3192.: [#<D-S 646665>] got HOLD 0 #<19540>--->#<19560> from #<19575> w/ root-bucket (#<19585>)
3302.: [#<D-S 646665>] reweighting roots (#<19585> #<22132> #<15017> #<18154> #<12197> #<15015> #<23776> #<22120> #<25451> #<19570> #<14973> #<18122> #<22118> #<19575>) by 0
3334.: [#<D-S 646665>] closing with success
3418.: [#<D-S 676192>] got HOLD 0 #<19585>--->#<19562> from #<19585> w/ root-bucket (#<19575> #<22118>)
3445.: [#<D-S 646635>] reweighting roots (#<18154> #<18122> #<19585> #<19575> #<14973> #<19570> #<25451> #<22120> #<23776> #<22132> #<22118> #<12197> #<15017> #<15015>) by 0
3464.: [#<D-S 646635>] closing with success
3544.: [#<D-S 692400>] got HOLD 0 #<15015>--->#<19592> from #<15015> w/ root-bucket (#<12197> #<22118> #<18154> #<15017>)
3581.: [#<D-S 676192>] reweighting roots (#<22118> #<18122> #<14973> #<19570> #<25451> #<22120> #<23776> #<15015> #<12197> #<18154> #<15017> #<22132> #<19575> #<19585>) by 0
3615.: [#<D-S 676192>] closing with success
3625.: [#<D-S 703377>] got HOLD 0 #<22132>--->#<23786> from #<22132> w/ root-bucket (#<15017> #<25451> #<22118> #<22120> #<23776>)
3743.: [#<D-S 703377>] reweighting roots (#<25451> #<22120> #<22118> #<18122> #<19585> #<19575> #<14973> #<19570> #<15017> #<18154> #<12197> #<15015> #<23776> #<22132>) by 0
3777.: [#<D-S 703377>] closing with success
3908.: [#<D-S 692400>] reweighting roots (#<18154> #<18122> #<19585> #<19575> #<14973> #<19570> #<25451> #<22120> #<23776> #<22132> #<22118> #<12197> #<15017> #<15015>) by 0
3945.: [#<D-S 692400>] closing with success
4072.: [#<D-S 573423>] reweighting roots (#<22120> #<22118> #<18122> #<19585> #<19575> #<14973> #<19570> #<15017> #<18154> #<12197> #<15015> #<25451> #<22132> #<23776>) by 0
4119.: [#<D-S 573423>] closing with success
...

After telling the supervisor to abort in this case, we still get a livelock. We first explored scanning wider next time (via injecting SCAN-LOOP t into the source-root's command stack) so that the time between scan-interruptions increases, giving a heavily-interrupted root time to get to the end of its own self-initiated scan. This, however, just makes the livelock less likely but doesn't eliminate it completely.

The true "root cause" is that, as the problem graph gets denser, hold clusters get bigger, and then we get a combinatorial explosion of scan-interrupts that starves some of the roots / results in a livelock-like scenario.

Thus, we need a way to ensure only one root in the hold cluster scans the cluster. However, we can’t use the regular lock because need to also lock the potential target-root and its cluster once we get to the inner critical section AND either of the following lock-based approaches has flaws:

  1. if we just extend the lock, we open ourselves up to livelock between the cluster and the target
  2. if we instead unlock the cluster and then re-lock the cluster and the target, we avoid (1), but open ourselves up to livelock within the cluster (another root in our cluster could get to the first lock before we get to the second lock, etc.)

This implies that we need a new mechanism to ensure only one root scans the cluster. Implementing a lower-tier form of BROADCAST-LOCK is one idea, but this is potentially quite hairy (having a higher-tier lock blow away a lower-tier lock without leaving hanging pointers/latches). Luckily, the MULTIREWEIGHT-BROADCAST-SCAN only targets the roots in the hold cluster, and it’s okay if the scans get interrupted by an external standard lock, so I we can just set/unset a flag on the roots. We call this "claiming" and "releasing" the set of roots in the hold-cluster, and it follows a pattern similar to BROADCAST-LOCK/BROADCAST-UNLOCK. At any point in time, only one root in the cluster can have a claim over the cluster.

@karalekas karalekas changed the title If best MRW scan rec has weight zero, abort + scan wider next time Establish a mutex-like claim on the hold cluster before MRW scanning Jul 15, 2025
Copy link
Contributor

@ecpeterson ecpeterson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I feel OK about this. In just the 12h since we spoke, I've already forgotten some of the context — e.g., I can't tell where SCAN interruption fighting has disappeared to — but then again I could barely keep it in my head in the moment. So, this is a low quality review, but it is also a ✅

@ecpeterson
Copy link
Contributor

When you were first describing this as "another level of lock", I was most worried about more message non-response guards. I'm very pleased to see that that sort of thing is absent from the ultimate formulation.

@karalekas karalekas merged commit f6331a9 into main Jul 15, 2025
1 check passed
@karalekas karalekas deleted the hold-scan-loop-t branch July 15, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants