Adding support for FSDP2 #109

rithwik-db · 2025-07-11T23:01:14Z

Moving all of the FSDP.summon_full_params to support both FSDP1 and FSDP2 (fully_shard). Unfortunately, functionality doesn't exist by default so I created a workaround to handle the args that we use within the codebase.

Tests:

FSDP1: compose-rl-grpo-test-WHoytM
FSDP2: compose-rl-grpo-test-fsdp2-q0Gy6Q <- tested with the changes from Composer's PR

They seem to fail due to unrelated issues that occur regardless of the changes (saving the checkpoints to UC Volumes), but the graphs look reasonable (link)

compose_rl/utils/utils.py

compose_rl/algorithms/online/generation_utils/vllm_utils.py

compose_rl/utils/utils.py

compose_rl/algorithms/online/generation_utils/vllm_utils.py

compose_rl/utils/utils.py

bowenyang008

After a read (w/o tests) I think the cognitive difficulty is too high for me, e.g., there are recurse=True/False, FSDP and non-FSDP module, DTensor vs non-DTensor etc and their combination makes it hard to reason its correctness. Here is my suggestion:

implements the summon_full_params in Composer, as it is a quite generic util
only supports either FSDP1 or DTensor (not just FSDP2), not a hybrid of them
if it is FSDP1, use FSDP1 summon; if there is DTensor, replace it with a full tensor, cache the FQN and tying for context swap as you have done otherwise do nothing
use this context regardless of if a module is FSDP module or not

let me know if this will work or if I missed anything

rithwik-db · 2025-07-22T05:54:03Z

@bowenyang008 re: this comment yeah this should work, moving it to Composer makes sense as I can just call it a summon_full_params_fsdp fn, although I will say that within compose-rl, if we do want to continue supporting FSDP1 alongside FSDP2, the changes in vllm_utils.py might have to stay the same since FSDP.summon_full_params() might require the module to be FSDP wrapped. Not 100% sure though, will check tomorrow.

bowenyang008 · 2025-07-22T06:16:45Z

@bowenyang008 re: this comment yeah this should work, moving it to Composer makes sense as I can just call it a summon_full_params_fsdp fn, although I will say that within compose-rl, if we do want to continue supporting FSDP1 alongside FSDP2, the changes in vllm_utils.py might have to stay the same since FSDP.summon_full_params() might require the module to be FSDP wrapped. Not 100% sure though, will check tomorrow.

I think it is covered by my 3rd point above, summon_full_params can work on FSDP1 module or module w/ or w/o DTensor, the last case it is basically a null context

pyproject.toml

bowenyang008

also LGTM, before merge can you run one more e2e test to make sure it produces the same result?

gupta-abhay

lgtm - would prefer merging the composer PR in first and not having to push arbitrary branches

compose_rl/algorithms/online/generation_utils/vllm_utils.py

rithwik-db · 2025-07-24T18:06:00Z

also LGTM, before merge can you run one more e2e test to make sure it produces the same result?

Will merge after the tests pass once the composer PR goes in + E2E test (but will only merge both once I get GPUs 😭 )

formatting plz format working on testing added new test that fails using dtensor APIs added some more tests formatted finally works smh formatted plz work im begging undid changes to reward modeling since that's breaking on main why doesn't the formatter work smh testing out specific FSDP2 change added another test added some logging wip a comment

rithwik-db requested review from bcui-db, dakinggg, gupta-abhay, abaheti95 and jdchang1 as code owners July 11, 2025 23:01

rithwik-db mentioned this pull request Jul 11, 2025

[WIP] Support compose-rl + FSDP2 changes mosaicml/composer#3898

Open

rithwik-db changed the title ~~[WIP] Adding support for FSDP2~~ Adding support for FSDP2 Jul 17, 2025

rithwik-db requested a review from bowenyang008 as a code owner July 18, 2025 04:41

rithwik-db force-pushed the support-fsdp2 branch from e2c708e to 5c34f78 Compare July 18, 2025 16:45

rithwik-db commented Jul 20, 2025

View reviewed changes