Skip to content

Adding support for FSDP2 #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

rithwik-db
Copy link
Collaborator

@rithwik-db rithwik-db commented Jul 11, 2025

Moving all of the FSDP.summon_full_params to support both FSDP1 and FSDP2 (fully_shard). Unfortunately, functionality doesn't exist by default so I created a workaround to handle the args that we use within the codebase.

Tests:

  • FSDP1: compose-rl-grpo-test-WHoytM
  • FSDP2: compose-rl-grpo-test-fsdp2-q0Gy6Q <- tested with the changes from Composer's PR

They seem to fail due to unrelated issues that occur regardless of the changes (saving the checkpoints to UC Volumes), but the graphs look reasonable (link)

@rithwik-db rithwik-db changed the title [WIP] Adding support for FSDP2 Adding support for FSDP2 Jul 17, 2025
@rithwik-db rithwik-db requested a review from bowenyang008 as a code owner July 18, 2025 04:41
Copy link
Collaborator

@bowenyang008 bowenyang008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a read (w/o tests) I think the cognitive difficulty is too high for me, e.g., there are recurse=True/False, FSDP and non-FSDP module, DTensor vs non-DTensor etc and their combination makes it hard to reason its correctness. Here is my suggestion:

  1. implements the summon_full_params in Composer, as it is a quite generic util
  2. only supports either FSDP1 or DTensor (not just FSDP2), not a hybrid of them
  3. if it is FSDP1, use FSDP1 summon; if there is DTensor, replace it with a full tensor, cache the FQN and tying for context swap as you have done otherwise do nothing
  4. use this context regardless of if a module is FSDP module or not

let me know if this will work or if I missed anything

@rithwik-db
Copy link
Collaborator Author

@bowenyang008 re: this comment yeah this should work, moving it to Composer makes sense as I can just call it a summon_full_params_fsdp fn, although I will say that within compose-rl, if we do want to continue supporting FSDP1 alongside FSDP2, the changes in vllm_utils.py might have to stay the same since FSDP.summon_full_params() might require the module to be FSDP wrapped. Not 100% sure though, will check tomorrow.

@bowenyang008
Copy link
Collaborator

@bowenyang008 re: this comment yeah this should work, moving it to Composer makes sense as I can just call it a summon_full_params_fsdp fn, although I will say that within compose-rl, if we do want to continue supporting FSDP1 alongside FSDP2, the changes in vllm_utils.py might have to stay the same since FSDP.summon_full_params() might require the module to be FSDP wrapped. Not 100% sure though, will check tomorrow.

I think it is covered by my 3rd point above, summon_full_params can work on FSDP1 module or module w/ or w/o DTensor, the last case it is basically a null context

@rithwik-db rithwik-db requested a review from bowenyang008 July 23, 2025 17:02
Copy link
Collaborator

@bowenyang008 bowenyang008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also LGTM, before merge can you run one more e2e test to make sure it produces the same result?

Copy link
Collaborator

@gupta-abhay gupta-abhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - would prefer merging the composer PR in first and not having to push arbitrary branches

@rithwik-db
Copy link
Collaborator Author

also LGTM, before merge can you run one more e2e test to make sure it produces the same result?

Will merge after the tests pass once the composer PR goes in + E2E test (but will only merge both once I get GPUs 😭 )

formatting

plz format

working on testing

added new test that fails

using dtensor APIs

added some more tests

formatted

finally works smh

formatted plz work im begging

undid changes to reward modeling since that's breaking on main

why doesn't the formatter work smh

testing out specific FSDP2 change

added another test

added some logging

wip

a comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants