Expired
Milestone
May 18, 2020–Jun 17, 2020
Automatic rolling node updates
The intention is to automate ComputeNode updates/changes. We have to define two new cases for updates. Case ColdChanges which is a name for an update which can only be done on nodes which have been drained beforehand otherwise Jobs would fail. E.g. everything including a reboot is a ColdUpdate. And HotChanges e.g. package installs or most security updates should not affect Jobs. Changes requiring a full outage, like unmounting the main filesystem are out of scope.
From a higher level perspective this can be thought about "similar to what puppet does" with it being aware of slurm an jobs
Rollout phases :
Phase | We implement in Version/Stages : ( features to be assigned to Version ) | |
---|---|---|
Phase 1 | - test-cluster only, perform HotChanges automatically, ensure zero jobs fail | |
Phase 1 | - on M3 perform CanaryTesting and HotChanges automatically with nodelist constraints. | |
Phase 2 | - on the test cluster work on feature completion and rolling updates towards centos7.7 and full documentation | |
Phase 3 | - get ops-team buy-in AND support | |
Phase 3 | - on M3 full HotChange Roll out | |
Phase 3 | - on M3 limited DrainChange roleout with approved rolling-heuristic | |
Phase 3 | - on M3 full DrainChange roleout | |
V2.0 |
All issues for this milestone are closed. You may close this milestone now.
Loading
Loading
Loading
Loading