Skip to content

Track TrainJob progress #2779

@astefanutti

Description

@astefanutti

What you would like to be added?

This feature proposes the TrainJob controller periodically "probes" the TrainJob rank 0 node to read the job progress / status and to expose it via API. This could either be in each TrainJob status and/or via a TrainJob "visibility" aggregated API / APIService.

The progression status should include a percentage (e.g. current steps / total steps) but an ETA would also be very useful.

The implementation outline could be:

  • Define the schema for the progression status API
  • Instrument training loops to periodically write their progression / status in the above format at a known location on their rank 0 node filesystem
    • For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
    • For built-in trainers, we may want to seamlessly instrument the runtime
  • Augment the TrainJob controller to periodically exec into the rank 0 nodes to read the status file and update the TrainJob statuses

One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.

Why is this needed?

Model training is an iterative process that's fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.

While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions