-
Notifications
You must be signed in to change notification settings - Fork 242
Description
Describe the bug
I am having issue with SonataFlow not being able to handle callback events for a workflow instance if there are multiple pods/instances of SonataFlow running.
For context, I am using the latest snapshot version of SonataFlow built manually from the github repositories.
I am deploying my SonataFlow app on a kubernetes cluster, and I am horizontally scaling the SonataFlow app to multiple pods.

My workflow consists of multiple "callback" states. Basically my SonataFlow app will orchestrate other services by sending messages to them via a messaging system.
So, when the callback state in my workflow will send the message to other services, then the callback state will wait for a callback event.
Once the other services finish its work, it will send a request to the corresponding event of the callback state so the workflow can proceed.
There are times where the workflow after publishing the message, the workflow still not finished processing the original request that starts workflow,
and during that in-between time, the other service sends a callback event to the workflow.
With just 1 pod/instance of SonataFlow running, what I see is that when SonataFlow receive the callback event call, it will wait until the previous request to finish, and then it processes the callback event request.
This leads to the workflow can proceed with the next states until it completes.
But when my SonataFlow deployment horizontally scales to multiple pods/instances, and if the callback event call reaches the pod that is not the one that first initialize the first state and publish the message, the other pod that receives the callback event request cannot continue with the next states of the workflow. But the callback event request still respond with a 202 http code.
This leads to the messaging system considering the message for the callback event request is successful and acknowledged.

Expected behavior
From my point of view, the callback event request should fail with a 500 http code response, because it was not able to process that event.
Actual behavior
The callback event request to the 2nd pod still respond with a 202 http code, meaning the event was consumed but the workflow does not transition to the next state.
How to Reproduce?
To reproduce, I also prepare an example zip file containing the workflows with a readme to reproduce.
Output of uname -a
or ver
No response
Output of java -version
17
GraalVM version (if different from Java)
No response
Kogito version or git rev (or at least Quarkus version if you are using Kogito via Quarkus platform BOM)
No response
Build tool (ie. output of mvnw --version
or gradlew --version
)
No response
Additional information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status