Establish a mutex-like claim on the hold cluster before MRW scanning #79
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The original syndrome was a reweight by zero indefinitely, e.g.:
After telling the supervisor to abort in this case, we still get a livelock. We first explored scanning wider next time (via injecting
SCAN-LOOP t
into the source-root's command stack) so that the time between scan-interruptions increases, giving a heavily-interrupted root time to get to the end of its own self-initiated scan. This, however, just makes the livelock less likely but doesn't eliminate it completely.The true "root cause" is that, as the problem graph gets denser, hold clusters get bigger, and then we get a combinatorial explosion of scan-interrupts that starves some of the roots / results in a livelock-like scenario.
Thus, we need a way to ensure only one root in the hold cluster scans the cluster. However, we can’t use the regular lock because need to also lock the potential target-root and its cluster once we get to the inner critical section AND either of the following lock-based approaches has flaws:
This implies that we need a new mechanism to ensure only one root scans the cluster. Implementing a lower-tier form of
BROADCAST-LOCK
is one idea, but this is potentially quite hairy (having a higher-tier lock blow away a lower-tier lock without leaving hanging pointers/latches). Luckily, theMULTIREWEIGHT-BROADCAST-SCAN
only targets the roots in the hold cluster, and it’s okay if the scans get interrupted by an external standard lock, so I we can just set/unset a flag on the roots. We call this "claiming" and "releasing" the set of roots in the hold-cluster, and it follows a pattern similar toBROADCAST-LOCK
/BROADCAST-UNLOCK
. At any point in time, only one root in the cluster can have a claim over the cluster.