HPCasCode issueshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues2020-05-15T20:47:44+10:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/1Syntax2020-05-15T20:47:44+10:00Gin TanSyntax@chines
https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/blob/master/roles/rsyslog_client/tasks/main.yml#L10
This should be apt@chines
https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/blob/master/roles/rsyslog_client/tasks/main.yml#L10
This should be aptChris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/12playbook or role tagging with HotChange and DrainChange2020-05-25T20:17:46+10:00Andreas Hamacherplaybook or role tagging with HotChange and DrainChangeAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/13autoupdate slack hookup2021-09-29T17:19:37+10:00Andreas Hamacherautoupdate slack hookupAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/14autoupdate test-cases2021-09-29T17:19:37+10:00Andreas Hamacherautoupdate test-casesAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/15use ansible vault2021-09-29T17:17:55+10:00Andreas Hamacheruse ansible vaulttrung wants it very early in the processtrung wants it very early in the processAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/16hook up a change detector, capable of returning ERROR, GOOD, HotChange-nodeli...2020-06-16T09:27:47+10:00Andreas Hamacherhook up a change detector, capable of returning ERROR, GOOD, HotChange-nodelist or DrainChange-nodelistAutomatic rolling node updatesChris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/17Host blacklist, whitelist and canarilist ( crash if a node is blacklisted AND...2021-09-29T17:19:38+10:00Andreas HamacherHost blacklist, whitelist and canarilist ( crash if a node is blacklisted AND ( white or canary)) just as an extra safety mechanism for e.g. sql and mgmtnodesAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/18Canary testing: new commit( remember sha) , test canary nodelist first, only ...2021-09-29T17:19:38+10:00Andreas HamacherCanary testing: new commit( remember sha) , test canary nodelist first, only when that one passes then role out largelyAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/21a mechanism to pause the whole thing during maintenance or while bugfixing a ...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to pause the whole thing during maintenance or while bugfixing a playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/19a mechanism to online update the nodelist to be able to add nodes to the mech...2021-09-29T17:19:37+10:00Andreas Hamachera mechanism to online update the nodelist to be able to add nodes to the mechanismAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/20a mechanism to update the repository and playbook2021-09-29T17:19:36+10:00Andreas Hamachera mechanism to update the repository and playbookAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/22think about max-retry time e.g. stop trying after 5 failures2021-09-29T17:19:38+10:00Andreas Hamacherthink about max-retry time e.g. stop trying after 5 failuresAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/5pam_slurm_adopt should install on version mismatch2019-11-25T17:26:48+11:00Chris Hinespam_slurm_adopt should install on version mismatchhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/3the pid file paths given in the systemd unit files for slurm do not match the...2019-12-11T12:51:58+11:00Chris Hinesthe pid file paths given in the systemd unit files for slurm do not match the paths given in the slurm.conf and slurmdbd.confThis is hard to resolve since the *.conf files are stored in the cluster specific repo, but the templates for the services are in this repo. Perhaps as part of installing the service files we can cat the conf files and store the valueThis is hard to resolve since the *.conf files are stored in the cluster specific repo, but the templates for the services are in this repo. Perhaps as part of installing the service files we can cat the conf files and store the valuehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/2role duplication ?2020-01-28T18:37:08+11:00Andreas Hamacherrole duplication ?At least this template exists twice:
git diff ./roles/slurm-common/templates/job_submit.lua.j2 ./roles/slurm_config/templates/job_submit.lua.j2
I reckon the underlying problem is that one of the roles slurm-common and slurm_config is "old"At least this template exists twice:
git diff ./roles/slurm-common/templates/job_submit.lua.j2 ./roles/slurm_config/templates/job_submit.lua.j2
I reckon the underlying problem is that one of the roles slurm-common and slurm_config is "old"Gin TanGin Tanhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/6Make sure ldap servers have larger size counts2020-02-21T11:28:26+11:00Chris HinesMake sure ldap servers have larger size countshttps://confluence.atlassian.com/crowdkb/openldap-only-synchronizes-500-user-942838720.htmlhttps://confluence.atlassian.com/crowdkb/openldap-only-synchronizes-500-user-942838720.htmlhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/4Testing pipelines2019-12-11T12:47:59+11:00Chris HinesTesting pipelines1) Yaml lint : all commits all branches
2) heat stack create and delete and ansible : only on commits to master
stages:
lint
deploy # heat and ansible
build # install software (/usr/local)
provision # create user accounts
...1) Yaml lint : all commits all branches
2) heat stack create and delete and ansible : only on commits to master
stages:
lint
deploy # heat and ansible
build # install software (/usr/local)
provision # create user accounts
test # submit a job
destroy
lint:
stage: lint
tags:
linter
script
yamnlint *
deploy:
stage: deploy
tags:
e2e
only:
master@hpc-team/ansible_cluster_in_a_box
script:
pip install openstack-client
pip
heat stack create
wait_for_head
ansible-playbook
test:
stage: testAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/7CICD Lint stage2020-02-20T13:56:15+11:00Andreas HamacherCICD Lint stageContacts :Luhan, Chris, Andreas
@Chris I want you to tell the others what to do
Lets define 3 problems first: and pick them apart later
1) The lint stage fails because of Lint problems
Fix: Fix the lint problems. ( Luhan happy to do so...Contacts :Luhan, Chris, Andreas
@Chris I want you to tell the others what to do
Lets define 3 problems first: and pick them apart later
1) The lint stage fails because of Lint problems
Fix: Fix the lint problems. ( Luhan happy to do so )
Current Workaround : allow stage to fail
2) we currently lint via traversing the master playbook this does not cover all roles
maybe there is another traversal mechanism. best case we lint all files BUT do not fix outdated ones!
3) the master playbook does not represent the roles in use and/or in production. How do we map this. do we fix all ?Chris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/31hypervisor maintenance2020-09-03T10:40:46+10:00Andreas Hamacherhypervisor maintenance* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as...* modify the inventory to be able to contact the hypervisor during autoupdate-drain
* come up with an authentication mechanism across VMS and hypervisors. e.h. hpc_ca is not present on the hypervisors
* get in contact with cloud again as soon as this mechanism exists
* question for later: do we need to modify VM state for example if we want to reboot the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/33edgecases for COLD-drain-jobs and auto-resumes2020-09-03T10:40:50+10:00Andreas Hamacheredgecases for COLD-drain-jobs and auto-resumesreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWreported by LW. This is a keep in mind "issue"
Training reservations
Last minute management requests
Nectar upgrades for security which need to be done NOWAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/27automated hypervisor_upgrade while rolling upgrades2020-09-03T10:40:40+10:00Andreas Hamacherautomated hypervisor_upgrade while rolling upgradeswhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorwhile a node is reserved and drained have a callback for a cloud-team playbook to maintain the hypervisorAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/8feature request md5sum comparator2020-04-02T17:54:45+11:00Andreas Hamacherfeature request md5sum comparatorhey, can I ask you for another favour ( just say no if you dont want to )
I want a generic script which gets 1 or 2 parameters
script file/on/node file/in/repo
and returns true or false
if
(md5sum of file/on/node is equal on all nodes )
...hey, can I ask you for another favour ( just say no if you dont want to )
I want a generic script which gets 1 or 2 parameters
script file/on/node file/in/repo
and returns true or false
if
(md5sum of file/on/node is equal on all nodes )
and if 2 parameters are given then it should be identical to md5sum file/in/repoChris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/24(to be discussed) Develop a heuristic to maximize uptame and minimize user im...2020-09-03T10:40:29+10:00Andreas Hamacher(to be discussed) Develop a heuristic to maximize uptame and minimize user impactAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/9ansible restart improvements2020-09-03T10:40:37+10:00Andreas Hamacheransible restart improvementsAndreas H 12:32 PM
is there a really good example of a node reboot somewhere in our ansible work ?
lancew:koala: 12:37 PM
I'd like to see one too. If there is anything wrong with lustre or its modules the reboot fails and require...Andreas H 12:32 PM
is there a really good example of a node reboot somewhere in our ansible work ?
lancew:koala: 12:37 PM
I'd like to see one too. If there is anything wrong with lustre or its modules the reboot fails and requires a hard reboot from openstack.
chines 3:23 PM
@Andreas H... not really. The old pattern looks like this
94 - name: Restart host to remove nouveau module
95 shell: "sleep 2 && shutdown -r now &"
96 async: 1
97 poll: 1
98 become: true
99 ignore_errors: true
100 when: modules_result.stdout.find('nouveau') != -1
101
102 - name: Wait for host to reboot
103 local_action: wait_for host="{{ inventory_hostname }}" search_regex=OpenSSH port=22 delay=60 timeout=900
but I think the new pattern looks like
95 - name: restart machine
96 shell: "sleep 5; sudo shutdown -r now"
97 async: 2
98 poll: 1
99 ignore_errors: true
100 become: true
101 become_user: root
102 when: reboot_now
103
104 - name: waiting for server to come back
105 wait_for_connection: sleep=60 timeout=600 delay=60
106 when: reboot_now
3:24
the async directive allows it to disconnect before the process completes. wait_for_connection is better than local_action wait_for
3:24
but as lance mentions the restart (shutdown -r now) will fail under some conditions
lancew:koala: 3:45 PM
Oh also forgot any hanging NFS file open/writes, which we have significant numbers of. I haven't been checking recently but I was tracking of the the order of hundreds per day in the home foldersAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/10TODO remove m3 assumptions in this repository2020-05-17T15:16:21+10:00Andreas HamacherTODO remove m3 assumptions in this repository 30 #if not defined, default to M3=vlan 114 ;
31 #See https://webnet.its.monash.edu.au/cgi-bin/staff-only/netsee
32 - set_fact: PRIVATE_NETWORK_CIDR="172.16.200.0/21"
33 when: PRIVATE_NETWORK_CIDR is undefined 30 #if not defined, default to M3=vlan 114 ;
31 #See https://webnet.its.monash.edu.au/cgi-bin/staff-only/netsee
32 - set_fact: PRIVATE_NETWORK_CIDR="172.16.200.0/21"
33 when: PRIVATE_NETWORK_CIDR is undefinedhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/11Software Architecture discussion and documentation2020-05-19T13:59:42+10:00Andreas HamacherSoftware Architecture discussion and documentationAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/23Documentation2021-09-29T17:20:15+10:00Andreas HamacherDocumentationAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/25Readme improvements2020-05-20T17:18:35+10:00Andreas HamacherReadme improvementsPlease add comments herePlease add comments herehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/28enable kernel version pinning for ubuntu2020-06-11T11:07:26+10:00Andreas Hamacherenable kernel version pinning for ubuntu/roles/upgrade/tasks/main.yml has been rewritten for yum modules but not for apt/roles/upgrade/tasks/main.yml has been rewritten for yum modules but not for aptmerge back cvl@uwa and have it as a valid third cluster on the pipelinehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/29run ansible --check and output inconsistencies for consumption by autoupdate2020-06-12T09:20:20+10:00Andreas Hamacherrun ansible --check and output inconsistencies for consumption by autoupdateoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeoutput a list of nodes requiring a ColdChange
output a list of nodes requiring a HoyChangeAutomatic rolling node updatesAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/30humans do not work at night2021-09-29T17:19:38+10:00Andreas Hamacherhumans do not work at night* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 am* only run autoupdate during working hours
* ping on slack. OPS needs to be in the information loop.
* even if drain is at 20.00pm do not start before 8:30 amAutomatic rolling node updateshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34NVME disk not mounted2021-10-28T10:16:51+11:00Chris HinesNVME disk not mountedWhen we have NVME disks in VMs we attempt to use them as spank private tmpdir BUT there is an assumption that they are mounted on /mnt/nvme which is not true for all images/flavours that include nvmeWhen we have NVME disks in VMs we attempt to use them as spank private tmpdir BUT there is an assumption that they are mounted on /mnt/nvme which is not true for all images/flavours that include nvmehttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/35role slurmdb-config requires commited password2022-06-02T16:05:46+10:00Andreas Hamacherrole slurmdb-config requires commited passwordthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confthe following should be a template
- name: install slurmdb.conf
copy:
src: files/slurmdbd.confhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/36SSSD cache sometimes unreliable and needing a restart2022-09-20T12:19:13+10:00Andreas HamacherSSSD cache sometimes unreliable and needing a restart@chines If you find the time maybe have a look, if not, thats ok.
Cryospark died when writing this.
a monarch-mgmt died as well quite recently@chines If you find the time maybe have a look, if not, thats ok.
Cryospark died when writing this.
a monarch-mgmt died as well quite recentlyhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/37cgroup_allowed_devices.conf does not exist and should not2022-09-24T09:26:06+10:00Andreas Hamachercgroup_allowed_devices.conf does not exist and should not[USERNAME@hi00 conf]$ cat cgroup.conf
CgroupAutomount=yes
ConstrainDevices=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainKmemSpace=no`
AllowedDevicesFile=/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf
/opt/slurm-22.05.3/etc/...[USERNAME@hi00 conf]$ cat cgroup.conf
CgroupAutomount=yes
ConstrainDevices=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainKmemSpace=no`
AllowedDevicesFile=/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf
/opt/slurm-22.05.3/etc/cgroup_allowed_devices.conf does not existAndreas HamacherAndreas Hamacher