Skip to content

fix: jobparameters and jobattributes pydantic models #626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Aug 7, 2025

Aims at fixing #603 in LHCb

@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch 2 times, most recently from 229214d to 47e87e4 Compare August 7, 2025 10:30
@aldbr aldbr marked this pull request as draft August 7, 2025 11:01
@aldbr
Copy link
Contributor Author

aldbr commented Aug 7, 2025

A few questions, problably for @fstagni:

When it comes to job attributes, it seems to be in order, I haven't noticed any attribute that was not referenced in the pydantic model, both in Dirac and in the extension.

So now the questions are:

  • do we want to be strict on the job metadata we accept? On the one hand, the current approach is very permissive and it's very easy to add new parameters, but on the other hand it's also risky not to have a proper model (e.g. ErrorMessage vs Error Message).
  • If we are strict, do we want to allow extensions to extend the jobs metadata (parameters and attributes)? In LHCbDIRAC, we use LogURL but I am wondering whether it is actually needed.

From what I can see, adding a very large number of fields in a dynamic mapping could lead to a mapping explosion (https://www.elastic.co/docs/troubleshoot/elasticsearch/mapping-explosion) but it looks like we are far from the limit.

@aldbr
Copy link
Contributor Author

aldbr commented Aug 7, 2025

Right now, with this patch, we only set the job metadata that are explicitly defined in the models and we miss many of them:

for pname, pvalue in metadata.model_dump(
by_alias=True, exclude_none=True
).items():
# An argument can be a job attribute and/or a job parameter
# Check if the argument is a valid job attribute (using alias)
if pname in JOB_ATTRIBUTES_ALIASES:
attr_updates[job_id][pname] = pvalue
# Check if the argument is a valid job parameter (using alias)
if pname in JOB_PARAMETERS_ALIASES:
param_updates[job_id][pname] = pvalue

I could just revert the commit, but we might have job metadata that are defined both as attributes and parameters (e.g. JobType) and #626 solved this issue.
Now if we don't have an explicit model, we can't really know whether a given parameter should be defined as a job attribute or a job parameter internally.
In practice, as the JobAttributes model is well defined, we could just assume that unknown metadata should be stored as job parameter, but this is just an assumption. I will implement that solution at the moment, but I think it's fragile.

@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch from 47e87e4 to cbdeff4 Compare August 7, 2025 15:11
@fstagni
Copy link
Contributor

fstagni commented Aug 7, 2025

A few questions, problably for @fstagni:

* Do you know why these fields are typed as long and not as float?
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/DB/JobParametersDB.py#L20-L21
  Update: they should likely be defined as integers, and be casted as integers by the pydantic model.

CPUNormalizationFactor should probably be a float, NormCPUTime(s) an integer

* Is `AgentLocalSE` still used? Why isn't it included in the OS fields?
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapper.py#L271-L272

Does not look like it is still used.

* Same question for `JobWrapperPID`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapper.py#L277

Most probably useless by now

* I also see `OutputSandboxMissingFiles`, `OutputSandbox`, `OutputSandboxLFN`, `UploadedOutputData`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapper.py#L944-L947

OutputSandboxLFN is needed. The others I would not know.

* Also in `Watchdog.py`, there are various job parameters not described into OS fields (e.g. `LoadAverage`):
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L742-L743
  IIUC there are even multiple parameters that describe the `MemoryUsed`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L716
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/Watchdog.py#L800

I vaguely remember @arrabito making similar question, e.g. check DIRACGrid/DIRAC#7757 and DIRACGrid/DIRAC#7509

* Also in `ReqClient` with `PendingRequest`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/RequestManagementSystem/Client/ReqClient.py#L362

Maybe not used but I would be careful to remove this.

* In `JobAgent` with `MatcherServiceTime` and `ErrorMessage`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L267
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L711

I think these 2 can go

* Also in `PushJobAgent` with `TaskID` and `Stamp`, which are reused to get the status of the jobs after their submission.
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/Agent/PushJobAgent.py#L637-L638

* And `DownloadInputData`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/Client/DownloadInputData.py#L217

* And `InputDataByProtocol`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/Client/InputDataByProtocol.py#L240

* In `JobWrapperUtilities`, I also see `Error Message` which is not the same as `ErrorMessage` we can find in `JobAgent`:
  https://github.com/DIRACGrid/DIRAC/blob/0d9a1649bcd5c3ea1be1702e579ca834d20fc8c1/src/DIRAC/WorkloadManagementSystem/JobWrapper/JobWrapperUtilities.py#L221

Oh remove this.

* And of course, because it's very permissive as of now, we can have dedicated job parameters in the extensions. In LHCbDIRAC, we find `Log URL`, which seems to be the only LHCb-specific job parameter.

When it comes to job attributes, it seems to be in order, I haven't noticed any attribute that was not referenced in the pydantic model, both in Dirac and in the extension.

That's normal, as job attributes are MySQL fields (parameters instead are free key-value pairs).

So now the questions are:

* do we want to be strict on the job metadata we accept? On the one hand, the current approach is very permissive and it's very easy to add new parameters, but on the other hand it's also risky not to have a proper model (e.g. `ErrorMessage` vs `Error Message`).

* If we are strict, do we want to allow extensions to extend the jobs metadata (parameters and attributes)? In LHCbDIRAC, we use `LogURL` but I am wondering whether it is actually needed.

What do we gain with being strict?

From what I can see, adding a very large number of fields in a dynamic mapping could lead to a mapping explosion (https://www.elastic.co/docs/troubleshoot/elasticsearch/mapping-explosion) but it looks like we are far from the limit.

@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch from cbdeff4 to 2802b6c Compare August 7, 2025 16:08
@aldbr
Copy link
Contributor Author

aldbr commented Aug 8, 2025

CPUNormalizationFactor should probably be a float, NormCPUTime(s) an integer

Then we would need to change the CPUNormalizationFactor field in OS.
Can I just modify it in the code in DIRAC and DiracX? Or do I need to change the mapping manually in OpenSearch?

That's normal, as job attributes are MySQL fields (parameters instead are free key-value pairs).

Indeed, I look at it too fast, I thought they were also free key-value pairs. But the current approach is still potentially "fragile" because extensions that have additional job attributes need to extend the JobAttributes pydantic model.

What do we gain with being strict?

We lose flexibility of course but we gain clarity. As demonstrated in the examples above, by being too flexible:

  • we lose track of the parameters used within the code (one could actually check the OpenSearch instance and see if a given parameter is used or not, but this is not that simple since different communities can rely on specific parameters). Can we easily drop a given parameter without impacting anyone?
  • we have similar but not exactly the same parameters: ErrorMessage vs Error Message.

What do we gain with being flexible? Do we actually need to set anything as a job parameters?
Something we could add to the ADR: why did we decide to have flexible job parameters instead of strict ones (like the job attributes). This is not so obvious to me, sorry.

(not really related but since we talk about them: job attributes could potentially be renamed as JobSchedulingMetadata and job parameters as JobExecutionMetadata, it would require a few changes because right now both concepts are blurry).

@fstagni
Copy link
Contributor

fstagni commented Aug 9, 2025

CPUNormalizationFactor should probably be a float, NormCPUTime(s) an integer

Then we would need to change the CPUNormalizationFactor field in OS. Can I just modify it in the code in DIRAC and DiracX? Or do I need to change the mapping manually in OpenSearch?

The mapping should be changed in OpenSearch

Copy link

read-the-docs-community bot commented Aug 19, 2025

Documentation build overview

📚 diracx | 🛠️ Build #29290995 | 📁 Comparing 359b889 against latest (02dadfa)


🔍 Preview build

No files changed.

@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch from bd77d43 to 47700ae Compare August 19, 2025 14:53
@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch from f10ce87 to 048d0fb Compare August 20, 2025 12:58
@aldbr aldbr force-pushed the fix_main_jobattr-params-pydantic branch from 048d0fb to 359b889 Compare August 22, 2025 11:44
@aldbr aldbr marked this pull request as ready for review August 22, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants