Skip to content

Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs#6092

Open
mm4tt wants to merge 5 commits into
kubernetes:masterfrom
mm4tt:kep_6089_was_controller_apis
Open

Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs#6092
mm4tt wants to merge 5 commits into
kubernetes:masterfrom
mm4tt:kep_6089_was_controller_apis

Conversation

@mm4tt
Copy link
Copy Markdown
Contributor

@mm4tt mm4tt commented May 18, 2026

  • One-line PR description: Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 18, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling May 18, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 18, 2026
@wojtek-t wojtek-t self-assigned this May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2026
@mm4tt mm4tt force-pushed the kep_6089_was_controller_apis branch 4 times, most recently from a6def00 to a4b2b49 Compare May 20, 2026 13:00
@mm4tt mm4tt force-pushed the kep_6089_was_controller_apis branch from a4b2b49 to 619ee1e Compare May 21, 2026 12:02
Copy link
Copy Markdown
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great KEP - I added few questions & concerns that I still have.


## Summary

This KEP proposes a standardized set of reusable API building blocks (`scheduling.k8s.io`), integration guidelines, and shared libraries to simplify how workload controllers (e.g., JobSet, TrainJob, RayJob, LWS, as well as core workloads like Job) integrate with Workload-Aware Scheduling (WAS).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please keep the lines shorter (it helps with readability and having more targeted comments/adjustments).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`) fully understands its own shape, hierarchy, and replication mechanics. It does not need the user to manually repeat this structure to the scheduler.
* **The User owns the Policy:** The user knows *how* they want the workload to be scheduled based on their specific environment (e.g., "I want gang scheduling", "I need these workers colocated on the same network rack").
* **The Controller acts as a Translator:** The real-workload controller consumes the user's high-level policy intent, combines it with its own structural knowledge, and acts as a compiler to generate the low-level `Workload` objects for the scheduler.
* **Universal Representation:** Legacy, standard pod-by-pod scheduling is represented natively as a first-class citizen (`Basic` mode). Controllers always generate the underlying `Workload` objects, using basic scheduling as the backward-compatible default for standalone Jobs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/standalone jobs/true workloads/ ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


We introduce a set of standard, reusable structs in the `scheduling.k8s.io` API group. These building blocks represent the core capabilities of Workload-Aware Scheduling. They are designed to be embedded directly into higher-level, controller-specific wrapper API structs.

Importantly, these structures are **hierarchy-agnostic**. The same primitives are used regardless of where they are embedded in a workload tree - whether at the `PodGroup` or `CompositePodGroup` level. TODO(mm4tt@): We should likely abandon this approach and have primities per PodGroup/CompositePodGroup level. We needed agnostic hierarchy when we thought we'd be modeling multi-level tree with PodSubGroup. In that case, depending on the context Job could be modeled either as PodGroup (standalone Job) or PodSubGroup (member of JobSet). With CompositePodGroup the problem disappears and Job will be always modeled as PodGroup.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully agree with this reasoning - for me the chane of PodGroup:PodSubGroup vs CompositePodGroup:PodGroup doesn't really change anything wrt it.

I think what really matters is the concepts that you want to express. Taking TAS as an example - if I want to reflect that a certain portion of my workload needs to be collocated - it doesn't really matter if it will be represented with PG or CPG - the intention is exactly the same and I shouldn't need to learn two different concepts to reflect that.

I can imagine some concepts where the difference will matter - but it should be case by case decision. The default approach should be the "hierarchy agnostic approach", and we should use split that only if we have good arguments where this means different things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this TODO, now we have per-level types as discussed.

}

// WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group.
type WorkloadTopologyConstraint struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we want to have a separate WorkloadSchedulingConstraints struct.

But arguably, the TopologyConstraint should actually be shared (we shouldn't introduce WorkloadTopologyConstraint here) - because that's a concept that reflects the topology (and thus by definition can't diverge from what you reflect in PodGroup).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two potential approaches here:

  1. We just re-use the existing structs from scheduling/v1alpha3.
  2. We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. TopologyConstraint is now shared.

We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

I discussed this with @liggitt and his recommendation was to have the building blocks under scheduling API group

// +optional
// +k8s:optional
// +k8s:unionMember
Single *WorkloadSingleDisruptionMode `json:"single,omitempty" protobuf:"bytes,1,opt,name=single"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

During updates to an active `Job`, the API server validation strategy enforces the following rules:
* **Immutable Fields (Updates are explicitly rejected):**
* Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa).
* Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we forbid them only after the job is started, you have an inherent race that the job-controller may have already started it (just didn't yet report it).

Why can't we just forbid any mutation for now?

Or at the very least, say that we forbid it for non-suspended jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.
* Changing dynamic `ResourceClaims` on running workloads.
* **Mutable Fields (Updates are allowed):**
* Modifying the `WorkloadGangSchedulingPolicy.MinCount` parameter. This can be done either explicitly by updating the `MinCount` field inside the JobSpec, or implicitly by scaling `spec.parallelism` (which automatically adjusts the default `MinCount`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelism change woudl trigger it only if the MinCount is unset, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, made it clear in the text.

1. **Detection:** The Job controller's reconcile loop detects the change and fetches the existing `Workload` resource from the API server.
2. **Tree Reconstruction:** It reconstructs the logical `WorkloadNode` tree, automatically feeding the new `parallelism` count into the `DefaultGangPolicy.MinCount` fallback.
3. **API Update:** Since `MinCount` in the low-level scheduler `Workload` API is mutable in v1.37+, the Job controller performs an **API Update** on the existing `Workload` resource.
4. **PodGroup Sync:** The Job controller propagates the new `MinCount` to the corresponding runtime `PodGroup` resource to ensure the scheduler immediately schedules the newly scaled pods.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this makes sense for Job, it's not the universal pattern. In many cases we don't want to update all existing PodGroups with the new param when it was updated.

I suggest making it clear that it's more of an exception than recommendation here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



#### 2. Downward Workload Template Mapping via Well-Known Annotations
If a composite controller delegates runtime `PodGroup` management to child execution controllers, we must solve a crucial coordination problem: **How does a child controller know exactly which `PodGroupTemplate` inside the parent's compiled `Workload` corresponds to its pods?**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more than that - we need to pass two separate information

  1. [what you describe] Which PodGroupTemplate/CompositePodGroupTemplate should be used by the child workload to create the corresponding PG/CPG

  2. [what's missing] What is the parent CompositePodGroup to which the newly created PG/CPG should be connected.
    This may be discoverable if we have 1:1 relation of CPGT<->CPG, but if we create multiple CPGs from a given template, this has to be passed.
    The best example for that is probably LWS, where CPG will be created by LWS controller per replica (so there can be many of them) and they are all created from the same CPGT.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! Added.

##### The Solution: Downward Mapping Annotation
To resolve this template linkage, the root and intermediate orchestrators must pass down the template mapping using a well-known metadata annotation on child templates:
* **Annotation Key:** `scheduling.k8s.io/pod-group-template`
* **Value:** The specific mapping identifier that links the child object to its corresponding `PodGroupTemplate` or `CompositeGroupTemplate` within the n-level parent `Workload` resource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already made a decision that names across all PGTs/CPGTs within a workload will be unique.

So we should clearly state that it should be name of PGT/CPGT.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- [KEP-4671: Gang Scheduling using Workload Object](https://kep.k8s.io/4671)
- [KEP-5710: Workload-aware preemption](https://kep.k8s.io/5710)
- [KEP-5732: Topology-aware workload scheduling](https://kep.k8s.io/5732)
- [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017)
- [KEP-6012: CompositePodGroup API](https://kep.k8s.io/6012)

is the actual tracking KEP :wink

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group.
type WorkloadTopologyConstraint struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two potential approaches here:

  1. We just re-use the existing structs from scheduling/v1alpha3.
  2. We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?

type JobSchedulingConfiguration struct {
// SchedulingPolicy defines the gang or basic scheduling rules for this Job.
// +optional
SchedulingPolicy *schedulingv1alpha3.WorkloadSchedulingPolicy `json:"schedulingPolicy,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the library provides validation, I'd assume we automatically inherit all the changes in the core by the virtue of using them, no? If we should adjust, is probably a separate question, that we'll likely consider on a case by case basis, the part I'm more worried about is will we remember?

ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
}
```
### Job Integration (batch/v1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note from my conversation with @mm4tt, this is probably the best place to link to from KEP-5547.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @helayoty will do that

During updates to an active `Job`, the API server validation strategy enforces the following rules:
* **Immutable Fields (Updates are explicitly rejected):**
* Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa).
* Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.

@mm4tt mm4tt marked this pull request as ready for review May 28, 2026 16:00
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@k8s-ci-robot k8s-ci-robot requested a review from palnabarun May 28, 2026 16:00
@mm4tt mm4tt force-pushed the kep_6089_was_controller_apis branch from c0af910 to 4bb5ff3 Compare May 28, 2026 16:11
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
@mm4tt mm4tt mentioned this pull request May 28, 2026
4 tasks
@helayoty
Copy link
Copy Markdown
Member

/area workload-aware

@k8s-ci-robot k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label May 28, 2026
@helayoty helayoty moved this from Backlog to Needs Review in Workload-aware & Topology-aware Workstream May 28, 2026
Copy link
Copy Markdown
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple more questions, but overall this looks great now!

owning-sig: sig-scheduling
participating-sigs:
- sig-apps
status: provisional
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementable

`replicatedJobs` and their parallelism, whereas a single child `Job` only knows its own pods).
* **Ownership & Skip Logic:** Child controllers (like standard `Job`) observe their
`OwnerReference` pointing to a registered parent workload and explicitly **bypass** creating
any `Workload` objects. This prevents duplicate resource creation and guarantees a single
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they still may need to create a PodGroup. I think it's worth mentioning here.

source of truth.
* **Separation of Structure and Policy:** The integration strictly separates real-workload
structure from scheduling policies:
* **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the controller that owns the structure - it's the API.

So maybe:
"The true workload API owns the structure"
?


This level-specific categorization allows independent API evolution.

The only exception to this division is the `TopologyConstraint` struct (reused directly from
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only exception now, but may not be the only exception eventually.

I don't know how to phrase that concisely, but basically if something is representing the "real-world" concept that generally is used verbatim by the scheduling stack - we should reuse it.
If it's more of an abstraction introduced by us - we should duplicate, because we may want to use that differently.

// API Group: scheduling.k8s.io/v1alpha3

// WorkloadPodGroupSchedulingConstraints defines leaf-level scheduling constraints, such as topology.
type WorkloadPodGroupSchedulingConstraints struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unconstructive comment:
I'm not a fan of those names, because they are becoming super long:
WorkloadCompositePodGroupSchedulingConstraints
is 46 characters :)

But as I said - it's unconstructive - I don't have a better suggestion :)


To resolve this template and hierarchy mapping without structural API schema changes, the root and
intermediate orchestrators must propagate these linkages downwards using two well-known metadata
annotations on the child object templates:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "on the child object templates"?

My assumption is that (taking JobSet as an example), JobSet controller will be setting those two annotations on the Job object when creating it.

If that's true - can you clarify in the KEP? If not, can you explain?


#### 2. Library API Definition

```go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the library will live?

To faciliate imports from arbitrary ecosystem projects, it definitely can't be k/k - it probably has to be some staging repo.
The question is - do we need a new staging repo for it?

kube-scheduler repo is not a good fit.
The only other alternative that I see is is "component-helpers", but I'm not sure it's a good fit.

@liggitt - thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

6 participants