Expired
Milestone
May 18, 2020–Jun 17, 2020
Automatic rolling node updates
The intention is to automate ComputeNode updates/changes. We have to define two new cases for updates. Case ColdChanges which is a name for an update which can only be done on nodes which have been drained beforehand otherwise Jobs would fail. E.g. everything including a reboot is a ColdUpdate. And HotChanges e.g. package installs or most security updates should not affect Jobs. Changes requiring a full outage, like unmounting the main filesystem are out of scope.
From a higher level perspective this can be thought about "similar to what puppet does" with it being aware of slurm an jobs
Rollout phases :
Phase | We implement in Version/Stages : ( features to be assigned to Version ) | |
---|---|---|
Phase 1 | - test-cluster only, perform HotChanges automatically, ensure zero jobs fail | |
Phase 1 | - on M3 perform CanaryTesting and HotChanges automatically with nodelist constraints. | |
Phase 2 | - on the test cluster work on feature completion and rolling updates towards centos7.7 and full documentation | |
Phase 3 | - get ops-team buy-in AND support | |
Phase 3 | - on M3 full HotChange Roll out | |
Phase 3 | - on M3 limited DrainChange roleout with approved rolling-heuristic | |
Phase 3 | - on M3 full DrainChange roleout | |
V2.0 |
All issues for this milestone are closed. You may close this milestone now.
Unstarted Issues (open and unassigned)
0
Ongoing Issues (open and assigned)
0
Completed Issues (closed)
18
- hypervisor maintenance
- humans do not work at night
- run ansible --check and output inconsistencies for consumption by autoupdate
- automated hypervisor_upgrade while rolling upgrades
- (to be discussed) Develop a heuristic to maximize uptame and minimize user impact
- Documentation
- think about max-retry time e.g. stop trying after 5 failures
- a mechanism to pause the whole thing during maintenance or while bugfixing a playbook
- a mechanism to update the repository and playbook
- a mechanism to online update the nodelist to be able to add nodes to the mechanism
- Canary testing: new commit( remember sha) , test canary nodelist first, only when that one passes then role out largely
- Host blacklist, whitelist and canarilist ( crash if a node is blacklisted AND ( white or canary)) just as an extra safety mechanism for e.g. sql and mgmtnodes
- hook up a change detector, capable of returning ERROR, GOOD, HotChange-nodelist or DrainChange-nodelist
- use ansible vault
- autoupdate test-cases
- autoupdate slack hookup
- playbook or role tagging with HotChange and DrainChange
- Software Architecture discussion and documentation
Loading
Loading
Loading