Track TrainJob progress

### What you would like to be added?

This feature proposes the TrainJob controller periodically "probes" the TrainJob rank 0 node to read the job progress / status and to expose it via API. This could either be in each TrainJob status and/or via a TrainJob "visibility" aggregated API / APIService.

The progression status should include a percentage (e.g. current steps / total steps) but an ETA would also be very useful.

The implementation outline could be:
* Define the schema for the progression status API
* Instrument training loops to periodically write their progression / status in the above format at a known location on their rank 0 node filesystem
  * For custom trainers, provide examples showing how to instrument the training loop, e.g. for HuggingFace Transformers Trainer callbacks
  * For built-in trainers, we may want to seamlessly instrument the runtime
* Augment the TrainJob controller to periodically exec into the rank 0 nodes to read the status file and update the TrainJob statuses

One benefit of this approach would be to not add any extra RBAC / security requirements for the TrainJob Pods that would still be able to run using the default service account.

### Why is this needed?

Model training is an iterative process that's fairly predictable, which makes tracking the progression of train jobs both possible, desirable and useful.

While a training job progression is usually accessible by reading the job rank 0 node logs, it might not be the best user experience for AI practitioners, nor provide the more robust mechanism for clients to access / parse this information.

### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track TrainJob progress #2779

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Track TrainJob progress #2779

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions