Skip to content

Callback event to multiple instances of SonataFlow #4002

@NguyenDinhThien-future-aavn

Description

Describe the bug

I am having issue with SonataFlow not being able to handle callback events for a workflow instance if there are multiple pods/instances of SonataFlow running.

For context, I am using the latest snapshot version of SonataFlow built manually from the github repositories.
I am deploying my SonataFlow app on a kubernetes cluster, and I am horizontally scaling the SonataFlow app to multiple pods.

Image

My workflow consists of multiple "callback" states. Basically my SonataFlow app will orchestrate other services by sending messages to them via a messaging system.
So, when the callback state in my workflow will send the message to other services, then the callback state will wait for a callback event.
Once the other services finish its work, it will send a request to the corresponding event of the callback state so the workflow can proceed.

There are times where the workflow after publishing the message, the workflow still not finished processing the original request that starts workflow,
and during that in-between time, the other service sends a callback event to the workflow.

With just 1 pod/instance of SonataFlow running, what I see is that when SonataFlow receive the callback event call, it will wait until the previous request to finish, and then it processes the callback event request.
This leads to the workflow can proceed with the next states until it completes.

But when my SonataFlow deployment horizontally scales to multiple pods/instances, and if the callback event call reaches the pod that is not the one that first initialize the first state and publish the message, the other pod that receives the callback event request cannot continue with the next states of the workflow. But the callback event request still respond with a 202 http code.
This leads to the messaging system considering the message for the callback event request is successful and acknowledged.

Image

Expected behavior

From my point of view, the callback event request should fail with a 500 http code response, because it was not able to process that event.

Actual behavior

The callback event request to the 2nd pod still respond with a 202 http code, meaning the event was consumed but the workflow does not transition to the next state.

How to Reproduce?

To reproduce, I also prepare an example zip file containing the workflows with a readme to reproduce.

callback-workflow.zip

Output of uname -a or ver

No response

Output of java -version

17

GraalVM version (if different from Java)

No response

Kogito version or git rev (or at least Quarkus version if you are using Kogito via Quarkus platform BOM)

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions