HPCasCode issueshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues2022-06-02T16:05:46+10:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/35role slurmdb-config requires commited password2022-06-02T16:05:46+10:00Andreas Hamacherrole slurmdb-config requires commited passwordthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/23Documentation2021-09-29T17:20:15+10:00Andreas HamacherDocumentationAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/18Canary testing: new commit( remember sha) , test canary nodelist first, only ...2021-09-29T17:19:38+10:00Andreas HamacherCanary testing: new commit( remember sha) , test canary nodelist first, only when that one passes then role out largelyAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/17Host blacklist, whitelist and canarilist ( crash if a node is blacklisted AND...2021-09-29T17:19:38+10:00Andreas HamacherHost blacklist, whitelist and canarilist ( crash if a node is blacklisted AND ( white or canary)) just as an extra safety mechanism for e.g. sql and mgmtnodesAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/30humans do not work at night2021-09-29T17:19:38+10:00Andreas Hamacherhumans do not work at night* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 am* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 amAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/22think about max-retry time e.g. stop trying after 5 failures2021-09-29T17:19:38+10:00Andreas Hamacherthink about max-retry time e.g. stop trying after 5 failuresAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/19a mechanism to online update the nodelist to be able to add nodes to the mech...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to online update the nodelist to be able to add nodes to the mechanismAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/21a mechanism to pause the whole thing during maintenance or while bugfixing a ...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to pause the whole thing during maintenance or while bugfixing a playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/14autoupdate test-cases2021-09-29T17:19:37+10:00Andreas Hamacherautoupdate test-casesAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/13autoupdate slack hookup2021-09-29T17:19:37+10:00Andreas Hamacherautoupdate slack hookupAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/20a mechanism to update the repository and playbook2021-09-29T17:19:36+10:00Andreas Hamachera mechanism to update the repository and playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/15use ansible vault2021-09-29T17:17:55+10:00Andreas Hamacheruse ansible vaulttrung wants it very early in the processtrung wants it very early in the processAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/33edgecases for COLD-drain-jobs and auto-resumes2020-09-03T10:40:50+10:00Andreas Hamacheredgecases for COLD-drain-jobs and auto-resumesreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/31hypervisor maintenance2020-09-03T10:40:46+10:00Andreas Hamacherhypervisor maintenance* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as...* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as soon as this mechanism exists
* question for later: do we need to modify VM state for example if we want to reboot the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/27automated hypervisor_upgrade while rolling upgrades2020-09-03T10:40:40+10:00Andreas Hamacherautomated hypervisor_upgrade while rolling upgradeswhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorwhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/9ansible restart improvements2020-09-03T10:40:37+10:00Andreas Hamacheransible restart improvementsAndreas H 12:32 PM
is there a really good example of a node reboot somewhere in our ansible work ?
lancew:koala: 12:37 PM
I'd like to see one too. If there is anything wrong with lustre or its modules the reboot fails and require...Andreas H 12:32 PM
is there a really good example of a node reboot somewhere in our ansible work ?
lancew:koala: 12:37 PM
I'd like to see one too. If there is anything wrong with lustre or its modules the reboot fails and requires a hard reboot from openstack.
chines 3:23 PM
@Andreas H... not really. The old pattern looks like this
94 - name: Restart host to remove nouveau module
95 shell: "sleep 2 && shutdown -r now &"
96 async: 1
97 poll: 1
98 become: true
99 ignore_errors: true
100 when: modules_result.stdout.find('nouveau') != -1
101
102 - name: Wait for host to reboot
103 local_action: wait_for host="{{ inventory_hostname }}" search_regex=OpenSSH port=22 delay=60 timeout=900
but I think the new pattern looks like
95 - name: restart machine
96 shell: "sleep 5; sudo shutdown -r now"
97 async: 2
98 poll: 1
99 ignore_errors: true
100 become: true
101 become_user: root
102 when: reboot_now
103
104 - name: waiting for server to come back
105 wait_for_connection: sleep=60 timeout=600 delay=60
106 when: reboot_now
3:24
the async directive allows it to disconnect before the process completes. wait_for_connection is better than local_action wait_for
3:24
but as lance mentions the restart (shutdown -r now) will fail under some conditions
lancew:koala: 3:45 PM
Oh also forgot any hanging NFS file open/writes, which we have significant numbers of. I haven't been checking recently but I was tracking of the the order of hundreds per day in the home foldersAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/24(to be discussed) Develop a heuristic to maximize uptame and minimize user im...2020-09-03T10:40:29+10:00Andreas Hamacher(to be discussed) Develop a heuristic to maximize uptame and minimize user impactAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/16hook up a change detector, capable of returning ERROR, GOOD, HotChange-nodeli...2020-06-16T09:27:47+10:00Andreas Hamacherhook up a change detector, capable of returning ERROR, GOOD, HotChange-nodelist or DrainChange-nodelistAutomatic rolling node updatesChris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/29run ansible --check and output inconsistencies for consumption by autoupdate2020-06-12T09:20:20+10:00Andreas Hamacherrun ansible --check and output inconsistencies for consumption by autoupdateoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/12playbook or role tagging with HotChange and DrainChange2020-05-25T20:17:46+10:00Andreas Hamacherplaybook or role tagging with HotChange and DrainChangeAutomatic rolling node updatesAndreas HamacherAndreas Hamacher