-
Notifications
You must be signed in to change notification settings - Fork 806
Description
What you would like to be added?
This feature proposes the TrainJob controller periodically "probes" the TrainJob rank 0 node to read the job progress / status and to expose it via API. This could either be in each TrainJob status and/or via a TrainJob "visibility" aggregated API / APIService.
The progression status should include a percentage (e.g. current steps / total steps) but an ETA would also be very useful.
The implementation outline could be:
- Define the schema for the progression status API
- Instrument training loops to periodically write their progression / status in the above format at a known location on their rank 0 node filesystem
- For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
- For built-in trainers, we may want to seamlessly instrument the runtime
- Augment the TrainJob controller to periodically exec into the rank 0 nodes to read the status file and update the TrainJob statuses
One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.
Why is this needed?
Model training is an iterative process that's fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.
While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.
Love this feature?
Give it a 👍 We prioritize the features with most 👍