Skip to content

Rename GPU related parameters and split CUDACapability classad in two (min, max) #12416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2025

Conversation

khurtado
Copy link
Contributor

@khurtado khurtado commented Jul 28, 2025

Fixes #11942

Status

Ready

Description

Rename GPU related parameters and split CUDACapability, following #11942 (comment)

A summary of the changes:

  • request_GPUs renamed to request_gpus
  • GPUMemoryMB renamed to DESIRED_GPUMemoryMB
  • CUDACapability replaced by DESIRED_GPUMinimumCapability and DESIRED_GPUMaximumCapability
  • CUDARuntime renamed to DESIRED_GPURuntime

Also, following: https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#gpus_minimum_capability , we need to:

  • assign DESIRED_GPUMemoryMB to gpus_minimum_memory
  • assign DESIRED_GPUMinimumCapability to gpus_minimum_capability
  • assign DESIRED_GPUMaximumCapability to gpus_maximum_capability
  • assign DESIRED_GPURuntime to gpus_minimum_runtime

The above will create this classad automatically from the macros above:

RequireGPUs  = Capability >= GPUsMinCapability && Capability <= GPUsMaxCapability && GlobalMemoryMb >= GPUsMinMemory && MaxSupportedVersion >= GPUsMinRuntime

and the above expression will be added to the general Requirements expression automatically as well.
HTCondor versions earlier than August 2024 won't append these expressions to the Requirements expressions. This works well with HTCondor 24.0.6 (used in production).

Is it backward compatible (if not, which system it affects?)

YES

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 4 warnings
    • 50 comments to review
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/904/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 4 warnings
    • 50 comments to review
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/905/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 4 warnings
    • 50 comments to review
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/906/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 4 warnings
    • 50 comments to review
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/907/artifact/artifacts/PullRequestReport.html

@khurtado
Copy link
Contributor Author

khurtado commented Jul 29, 2025

Classads are injected just fine, but the macros feature does not work because it requires at least HTCondor 23.8.1, and we are using 23.0.3

[cmst1@vocms0263 /data/dockerMount/srv/wmagent/current/install/JobSubmitter]$ condor_q -limit 1 -af:lr DESIRED_GPUMemoryMB DESIRED_GPUMinimumCapability DESIRED_GPUMaximumCapability DESIRED_GPURuntime
DESIRED_GPUMemoryMB = 8000 DESIRED_GPUMinimumCapability = "6.0" DESIRED_GPUMaximumCapability = "10.0" DESIRED_GPURuntime = "12.0"
[cmst1@vocms0263 /data/dockerMount/srv/wmagent/current/install/JobSubmitter]$ condor_q -limit 1 -af:lr Requirements
Requirements = (stringListMember(TARGET.Arch,REQUIRED_ARCH)) && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.GPUs >= RequestGPUs) && (TARGET.HasFileTransfer)

EDIT: Okay, it looks like my testbed agent is the one with 23.0.3 for some reason, other testbed agents and productions agents seem to work with 24.0.6 which should be okay. I will test on a different agent.

@khurtado
Copy link
Contributor Author

@amaltaro

@amaltaro I finished my test. Everything works well with 24.0.6.
This is ready for review

Here are the classads injected:

$ condor_q -limit 1 -af:hr DESIRED_GPUMemoryMB DESIRED_GPUMinimumCapability DESIRED_GPUMaximumCapability DESIRED_GPURuntime
DESIRED_GPUMemoryMB DESIRED_GPUMinimumCapability DESIRED_GPUMaximumCapability DESIRED_GPURuntime
8000                "6.0"                        "10.0"                       "12.0"

And here is the Requirements expression changed:

$ condor_q -limit 1 -af:hr Requirements
Requirements
(stringListMember(TARGET.Arch,REQUIRED_ARCH)) && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer)
[cmst1@vocms0193 ~]$ condor_q -limit 1 -af:hr RequireGPUs
RequireGPUs
Capability >= GPUsMinCapability && Capability <= GPUsMaxCapability && GlobalMemoryMb >= GPUsMinMemory && MaxSupportedVersion >= GPUsMinRuntime

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you, @khurtado

ad['My.DESIRED_GPUMemoryMB'] = str(job['gpuRequirements']['GPUMemoryMB'])
# CUDACapabilities is a list of strings, with each string matching this regex: r"^\d+.\d$"
# E.g.: ["1.0", "10.0", "2.1"]
cudaCapabilities = sorted(job['gpuRequirements']['CUDACapabilities'], key=float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

@amaltaro
Copy link
Contributor

@khurtado actually, can you please provide a short summary in the PR description?
In addition, please apply the relevant updates to the WMCore documentation. I think the only place for that update is: https://cms-wmcore.docs.cern.ch/wmcore/GPU-Support/#gpu-job-description-in-wmcore-x-glideinwms

@khurtado
Copy link
Contributor Author

@amaltaro I have updated the description and the documentation:

https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/93

@amaltaro
Copy link
Contributor

Thank you, Kenyi!

@amaltaro amaltaro merged commit f0d84d6 into dmwm:master Jul 30, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define GPU matchmaking expression between job and machine and rename job GPU parameters
3 participants