Expired
Milestone May 18, 2020–Jun 17, 2020

Automatic rolling node updates

The intention is to automate ComputeNode updates/changes. We have to define two new cases for updates. Case ColdChanges which is a name for an update which can only be done on nodes which have been drained beforehand otherwise Jobs would fail. E.g. everything including a reboot is a ColdUpdate. And HotChanges e.g. package installs or most security updates should not affect Jobs. Changes requiring a full outage, like unmounting the main filesystem are out of scope.

From a higher level perspective this can be thought about "similar to what puppet does" with it being aware of slurm an jobs

Rollout phases :

Phase We implement in Version/Stages : ( features to be assigned to Version )
Phase 1 - test-cluster only, perform HotChanges automatically, ensure zero jobs fail
Phase 1 - on M3 perform CanaryTesting and HotChanges automatically with nodelist constraints.
Phase 2 - on the test cluster work on feature completion and rolling updates towards centos7.7 and full documentation
Phase 3 - get ops-team buy-in AND support
Phase 3 - on M3 full HotChange Roll out
Phase 3 - on M3 limited DrainChange roleout with approved rolling-heuristic
Phase 3 - on M3 full DrainChange roleout
V2.0

IMG_20200518_142255__1_

  • Work items 18
  • Merge requests 0
  • Participants 2
  • Labels 6
Loading
Loading
Loading
Loading
100% complete
100%
Start date
May 18, 2020
May 18
-
Jun 17 2020
Due date
Jun 17, 2020 (Past due)
18
Work items 18 New issue
Open: 0 Closed: 18
0
Merge requests 0
Open: 0 Closed: 0 Merged: 0
0
Releases
None
Reference: hpc-team/HPCasCode%"Automatic rolling node updates"