HPCasCode issueshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues2022-09-24T09:26:06+10:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/37cgroup_allowed_devices.conf does not exist and should not2022-09-24T09:26:06+10:00Andreas Hamachercgroup_allowed_devices.conf does not exist and should not[USERNAME@hi00 conf]$ cat cgroup.conf
CgroupAutomount=yes
ConstrainDevices=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainKmemSpace=no`
AllowedDevicesFile=/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf
/opt/slurm-22.05.3/etc/...[USERNAME@hi00 conf]$ cat cgroup.conf
CgroupAutomount=yes
ConstrainDevices=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainKmemSpace=no`
AllowedDevicesFile=/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf
/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf does not existAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/36SSSD cache sometimes unreliable and needing a restart2022-09-20T12:19:13+10:00Andreas HamacherSSSD cache sometimes unreliable and needing a restart@chines If you find the time maybe have a look, if not, thats ok.
Cryospark died when writing this.
a monarch-mgmt died as well quite recently@chines If you find the time maybe have a look, if not, thats ok.
Cryospark died when writing this.
a monarch-mgmt died as well quite recentlyhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/35role slurmdb-config requires commited password2022-06-02T16:05:46+10:00Andreas Hamacherrole slurmdb-config requires commited passwordthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34NVME disk not mounted2021-10-28T10:16:51+11:00Chris HinesNVME disk not mountedWhen we have NVME disks in VMs we attempt to use them as spank private tmpdir BUT there is an assumption that they are mounted on /mnt/nvme which is not true for all images/flavours that include nvmeWhen we have NVME disks in VMs we attempt to use them as spank private tmpdir BUT there is an assumption that they are mounted on /mnt/nvme which is not true for all images/flavours that include nvmehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/33edgecases for COLD-drain-jobs and auto-resumes2020-09-03T10:40:50+10:00Andreas Hamacheredgecases for COLD-drain-jobs and auto-resumesreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/31hypervisor maintenance2020-09-03T10:40:46+10:00Andreas Hamacherhypervisor maintenance* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as...* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as soon as this mechanism exists
* question for later: do we need to modify VM state for example if we want to reboot the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/30humans do not work at night2021-09-29T17:19:38+10:00Andreas Hamacherhumans do not work at night* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 am* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 amAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/29run ansible --check and output inconsistencies for consumption by autoupdate2020-06-12T09:20:20+10:00Andreas Hamacherrun ansible --check and output inconsistencies for consumption by autoupdateoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/28enable kernel version pinning for ubuntu2020-06-11T11:07:26+10:00Andreas Hamacherenable kernel version pinning for ubuntu/roles/upgrade/tasks/main.yml has been rewritten for yum modules but not for apt/roles/upgrade/tasks/main.yml has been rewritten for yum modules but not for aptmerge back cvl@uwa and have it as a valid third cluster on the pipelinehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/27automated hypervisor_upgrade while rolling upgrades2020-09-03T10:40:40+10:00Andreas Hamacherautomated hypervisor_upgrade while rolling upgradeswhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorwhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/25Readme improvements2020-05-20T17:18:35+10:00Andreas HamacherReadme improvementsPlease add comments herePlease add comments herehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/24(to be discussed) Develop a heuristic to maximize uptame and minimize user im...2020-09-03T10:40:29+10:00Andreas Hamacher(to be discussed) Develop a heuristic to maximize uptame and minimize user impactAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/23Documentation2021-09-29T17:20:15+10:00Andreas HamacherDocumentationAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/22think about max-retry time e.g. stop trying after 5 failures2021-09-29T17:19:38+10:00Andreas Hamacherthink about max-retry time e.g. stop trying after 5 failuresAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/21a mechanism to pause the whole thing during maintenance or while bugfixing a ...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to pause the whole thing during maintenance or while bugfixing a playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/20a mechanism to update the repository and playbook2021-09-29T17:19:36+10:00Andreas Hamachera mechanism to update the repository and playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/19a mechanism to online update the nodelist to be able to add nodes to the mech...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to online update the nodelist to be able to add nodes to the mechanismAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/18Canary testing: new commit( remember sha) , test canary nodelist first, only ...2021-09-29T17:19:38+10:00Andreas HamacherCanary testing: new commit( remember sha) , test canary nodelist first, only when that one passes then role out largelyAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/17Host blacklist, whitelist and canarilist ( crash if a node is blacklisted AND...2021-09-29T17:19:38+10:00Andreas HamacherHost blacklist, whitelist and canarilist ( crash if a node is blacklisted AND ( white or canary)) just as an extra safety mechanism for e.g. sql and mgmtnodesAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/16hook up a change detector, capable of returning ERROR, GOOD, HotChange-nodeli...2020-06-16T09:27:47+10:00Andreas Hamacherhook up a change detector, capable of returning ERROR, GOOD, HotChange-nodelist or DrainChange-nodelistAutomatic rolling node updatesChris HinesChris Hines