Skip to content

[FIX] Dispose StatelessWorkerGrainContext when it contains no more workers #9636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ledjon-behluli
Copy link
Contributor

@ledjon-behluli ledjon-behluli commented Aug 4, 2025

When the inspection timer is created, the SW context itself is passed as the state argument. When all workers are destroyed the reference to this from the timer keeps the SW context itself alive from a GC perspective, so dispose wont get called.

This PR enqueues a DisposeAsync command when there are no more workers, as we can consider the SW to be destroyed as well.

fix #9634

Microsoft Reviewers: Open in CodeFlow

@ReubenBond
Copy link
Member

@vrecluse @gfix please take a look

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a memory leak issue where StatelessWorkerGrainContext instances were not being properly disposed due to a timer keeping a reference to the context. The fix ensures disposal by explicitly enqueuing a DisposeAsync command when the last worker is destroyed.

Key Changes

  • Adds explicit disposal command when worker count reaches zero
  • Prevents memory leaks by ensuring proper cleanup of StatelessWorkerGrainContext instances

Comment on lines 178 to 179
var completion = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(completion));
Copy link
Preview

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The completion variable is created but never used or awaited. Consider either using this TaskCompletionSource to wait for disposal completion or removing it if not needed.

Suggested change
var completion = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(completion));
EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(null));

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither is correct, for one the DisposeAsyncWorkItemState expects a non-null TCS, and awaiting the tcs.Task would deadlock the loop. Eitherway I hooked a continuation to at least log any potential errors that might occur in the dispose internal method.

Copy link
Contributor

@gfix gfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me! I am, however, not familiar enough with how activation/deactivation works to understand why DisposeAsync is not called when the SW grain is deactivated?

{
await _inspectionTimer.DisposeAsync().AsTask()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the change here needed? what is the benefit of AsTask().ConfigureAwait(...) against former DisposeAsync() -> only to suppress exceptions?

Copy link
Contributor Author

@ledjon-behluli ledjon-behluli Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only to suppress exceptions

Yes, this is an "just in case" thing so the rest of the method continues

@@ -172,9 +172,20 @@ private async Task RunMessageLoop()

if (_workers.Count == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a scenario where _workers.Count reached 0, but in the queue there is an item to process WorkItemType.Message which will then create a new worker? In that case a new worker will be created right after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very unlikely since the destruction of the workers comes after the fact of a grain disposing (which means it wont receive more messages), but possible! If that happens, the dispose command will be processed after that message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to stop the disposal if a worker exists within the SW context, we should not do that, we should respect the call for disposal from outside!

@DeagleGross
Copy link
Member

And a dumb question: there is a StatelessWorkerGrainContext and it is a single GrainContext per GrainAddress right? I am a bit lost with the hierarchy of the classes and why we even should call dispose on the GrainContext (initially i thought it should be shared for multiple grains and should not be disposed).

Can u please elaborate on when a context should be disposed in a normal scenario?

@ledjon-behluli
Copy link
Contributor Author

And a dumb question: there is a StatelessWorkerGrainContext and it is a single GrainContext per GrainAddress right? I am a bit lost with the hierarchy of the classes and why we even should call dispose on the GrainContext (initially i thought it should be shared for multiple grains and should not be disposed).

Can u please elaborate on when a context should be disposed in a normal scenario?

That is correct, there is one SW Context per grain GrainAddress (techically per GrainId-SiloAddress, but does not matter here).
Normal scenario would be the host is shutting down, but we the change (which is a good side-effect) we cleanup the SW if no workers are active. You can think of it like a way of doing a sort of "garbage collection", which is good as we'll release the timer too!

@ledjon-behluli
Copy link
Contributor Author

One thing we could do is to not dispose the SW context if the strategy does NOT specify idle worker removal so it would be aligned fully with previous behavior in all fronts

@ledjon-behluli
Copy link
Contributor Author

One thing we could do is to not dispose the SW context if the strategy does NOT specify idle worker removal so it would be aligned fully with previous behavior in all fronts

did this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Orleans 9.1.2) Stateless woker grain potential thread pool tasks leak?
4 participants