HPCasCode merge requestshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests2022-01-17T17:06:12+11:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/521Mig2022-01-17T17:06:12+11:00Chris HinesMigAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/520Add pam_slurm_adopt for ubuntu nodes2021-10-29T09:14:35+11:00Chris HinesAdd pam_slurm_adopt for ubuntu nodeshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/519Mount nvme disks on /mnt/nvme2021-10-29T09:01:04+11:00Chris HinesMount nvme disks on /mnt/nvmeaddresses https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34addresses https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/518modifications to playbooks because 1. we're not using ldap 2. we're mounting...2021-10-29T10:15:36+11:00Chris Hinesmodifications to playbooks because 1. we're not using ldap 2. we're mounting...modifications to playbooks because 1. we're not using ldap 2. we're mounting all the filesystems in the filesystems_playbook ahead of thesemodifications to playbooks because 1. we're not using ldap 2. we're mounting all the filesystems in the filesystems_playbook ahead of theseAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/517modifications to playbooks because 1. we're not using ldap 2. we're mounting...2021-10-29T10:15:42+11:00Chris Hinesmodifications to playbooks because 1. we're not using ldap 2. we're mounting...modifications to playbooks because 1. we're not using ldap 2. we're mounting all the filesystems in the filesystems_playbook ahead of thesemodifications to playbooks because 1. we're not using ldap 2. we're mounting all the filesystems in the filesystems_playbook ahead of theseAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/516Draft: Resolve "NVME disk not mounted"2021-10-28T09:27:38+11:00Chris HinesDraft: Resolve "NVME disk not mounted"Closes #34Closes #34Chris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/515Mellanox config2021-11-09T13:21:36+11:00Andreas HamacherMellanox configfactoring out mellanox_config from mellanox_install
TODO:
```
Host - Configuration of MLNX NIC:
Note: Disable Global Pause on all NICs
Step 1 - Set QoS parameters
### Use all commands and setup a startup script to apply on boot
Set DS...factoring out mellanox_config from mellanox_install
TODO:
```
Host - Configuration of MLNX NIC:
Note: Disable Global Pause on all NICs
Step 1 - Set QoS parameters
### Use all commands and setup a startup script to apply on boot
Set DSCP (L3) as trust mode for the NIC
```
- [x] # mlnx_qos -i <interface> --trust dscp
```
Set ToS to 106 (DSCP 26) for ALL RoCE traffic (Note: This command is nonpersistent)
```
- [x] # echo 106 > /sys/class/infiniband/<mlx-device>/tc/1/traffic_class
```
Set the RDMA-CM ToS to 106 (DSCP 26) (Note: This command is nonpersistent)
```
- [x] # cma_roce_tos -d <mlx-device> -t 106
```
Enable ECN for TCP traffic (Note: This command is nonpersistent)
```
- [x] # sysctl -w net.ipv4.tcp_ecn=1
```
Step 2 - Enable PFC on RoCE prioritry
Activate PFC on priority 3
Using mlnx_qos tool (Note: This command is nonpersistent):
```
- [x] # mlnx_qos -i <interface> --pfc 0,0,0,1,0,0,0,0
�```
Tuning of MLNX NIC:
Mellanox NIC tuning can be done by using the mlnx_tune tool included in the MLX OFED Drivers. It can also be downloaded separately if you are using the inbox drivers.
From discussion with Monash, we do not need to tune the cards as the default profile is working correctly.
Commands to run:
mlnx_tune -r
- Show a report of the current MLX NIC and System
```
- [x] # mlnx_tune -p <profile-name> //There are several profiles to select from, see below. High Throughput:Trung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/514Update telegraf_slurmstats.py2021-10-19T12:21:26+11:00Chris HinesUpdate telegraf_slurmstats.pyhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/513modifications to playbooks because 1. we're not using ldap 2. we're mounting...2021-11-10T15:05:29+11:00Chris Hinesmodifications to playbooks because 1. we're not using ldap 2. we're mounting...There are lots of minor commits to make the pipeline work with a different structure
now using ansible to create and destroy openstack resources
using make_files and template in CICD to calculate the contents of files and vars so we don'...There are lots of minor commits to make the pipeline work with a different structure
now using ansible to create and destroy openstack resources
using make_files and template in CICD to calculate the contents of files and vars so we don't need calculate* roles any more
Using inventory.yml format rather than the crazy bash hack that returns jsonAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/512updating spank plugin2021-10-15T12:36:16+11:00Andreas Hamacherupdating spank pluginUpdated spank plugin after complaints on ubuntu.
testd on m3t000 and m3f031
```
[username@m3-login1 ~]$ srun --partition=desktop --reservation=AWX -w m3f031 --qos=desktopq "hostname"
m3f031
```
`...Updated spank plugin after complaints on ubuntu.
testd on m3t000 and m3f031
```
[username@m3-login1 ~]$ srun --partition=desktop --reservation=AWX -w m3f031 --qos=desktopq "hostname"
m3f031
```
```
[root@m3f031 ~]# cat /var/log/slurmd.log
[2021-10-15T10:57:38.513] Node reconfigured socket/core boundaries SocketsPerBoard=1:3(hw) CoresPerSocket=3:1(hw)
[2021-10-15T10:57:38.513] Message aggregation disabled
[2021-10-15T10:57:38.515] CPU frequency setting not configured for this node
[2021-10-15T10:57:38.516] slurmd version 20.02.7 started
[2021-10-15T10:57:38.517] error: Invalid PrologSlurmctld(`/opt/slurm-latest/etc/slurmctld.prolog`): No such file or directory
[2021-10-15T10:57:38.517] slurmd started on Fri, 15 Oct 2021 10:57:38 +1100
[2021-10-15T10:57:40.792] CPUs=3 Boards=1 Sockets=3 Cores=1 Threads=1 Memory=13869 TmpDisk=30172 Uptime=3182687 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-10-15T11:22:13.602] _run_prolog: run job script took usec=362669
[2021-10-15T11:22:13.619] _run_prolog: prolog with lock for job 20944182 ran for 0 seconds
[2021-10-15T11:22:13.852] [20944182.extern] task/cgroup: /slurm/uid_11436/job_20944182: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:13.866] [20944182.extern] task/cgroup: /slurm/uid_11436/job_20944182/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.079] launch task 20944182.0 request from UID:11436 GID:10025 HOST:172.16.202.163 PORT:47788
[2021-10-15T11:22:15.080] lllp_distribution jobid [20944182] implicit auto binding: sockets,one_thread, dist 8192
[2021-10-15T11:22:15.080] _task_layout_lllp_cyclic
[2021-10-15T11:22:15.080] _lllp_generate_cpu_bind jobid [20944182]: mask_cpu,one_thread, 0x1
[2021-10-15T11:22:15.104] [20944182.0] _setup_stepd_job_info: SLURM_STEP_RESV_PORTS found 12261-12262
[2021-10-15T11:22:15.119] [20944182.0] task/cgroup: /slurm/uid_11436/job_20944182: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.126] [20944182.0] task/cgroup: /slurm/uid_11436/job_20944182/step_0: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.140] [20944182.0] task_p_pre_launch: Using sched_affinity for tasks
[2021-10-15T11:22:15.180] [20944182.0] done with job
[2021-10-15T11:22:15.214] [20944182.extern] done with job
[2021-10-15T11:23:03.434] _run_prolog: run job script took usec=207665
[2021-10-15T11:23:03.443] _run_prolog: prolog with lock for job 20944190 ran for 0 seconds
[2021-10-15T11:23:03.581] [20944190.extern] task/cgroup: /slurm/uid_11436/job_20944190: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:03.595] [20944190.extern] task/cgroup: /slurm/uid_11436/job_20944190/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.797] launch task 20944190.0 request from UID:11436 GID:10025 HOST:172.16.202.163 PORT:16045
[2021-10-15T11:23:04.798] lllp_distribution jobid [20944190] implicit auto binding: sockets,one_thread, dist 8192
[2021-10-15T11:23:04.798] _task_layout_lllp_cyclic
[2021-10-15T11:23:04.798] _lllp_generate_cpu_bind jobid [20944190]: mask_cpu,one_thread, 0x1
[2021-10-15T11:23:04.826] [20944190.0] _setup_stepd_job_info: SLURM_STEP_RESV_PORTS found 12265-12266
[2021-10-15T11:23:04.838] [20944190.0] task/cgroup: /slurm/uid_11436/job_20944190: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.845] [20944190.0] task/cgroup: /slurm/uid_11436/job_20944190/step_0: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.859] [20944190.0] task_p_pre_launch: Using sched_affinity for tasks
[2021-10-15T11:23:04.915] [20944190.0] done with job
[2021-10-15T11:23:04.949] [20944190.extern] done with job
```Simon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/511Chgmlxdrvcheckmode2021-10-12T12:58:00+11:00Andreas HamacherChgmlxdrvcheckmodeSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/510setting default xsessionmanager via update alternatives rather than changing...2021-10-15T14:22:25+11:00Andreas Hamachersetting default xsessionmanager via update alternatives rather than changing...setting default xsessionmanager via update alternatives rather than changing vncserver on Debian. Removing RHEL support because it is not tested at allsetting default xsessionmanager via update alternatives rather than changing vncserver on Debian. Removing RHEL support because it is not tested at allChris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/509Dgxmlx2021-10-27T11:33:41+11:00Andreas HamacherDgxmlxBaremetals e.g. dgx nodes should update the firmware. VMs shouldnot. And hypervisors are done somewhere elseBaremetals e.g. dgx nodes should update the firmware. VMs shouldnot. And hypervisors are done somewhere elseTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/508Logrotate on dgx2021-09-29T16:53:58+10:00Andreas HamacherLogrotate on dgxalready tested on PROD :( no other RHEL node availablealready tested on PROD :( no other RHEL node availableSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/507ibstat fails when run on a re-install before reboot2021-09-23T14:53:14+10:00Andreas Hamacheribstat fails when run on a re-install before rebootSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/506adding python3-jinja2 package deployment for RHEL2021-09-23T12:49:52+10:00Andreas Hamacheradding python3-jinja2 package deployment for RHELTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/505Fix syntax error in telegraf config for softnet_stats2021-09-21T12:16:03+10:00Chris HinesFix syntax error in telegraf config for softnet_statsNeed to escape quotes in the awk commandNeed to escape quotes in the awk commandhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/504fix up regex2021-09-20T17:21:31+10:00Chris Hinesfix up regexAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/503Ubuntu20baseimage2021-09-20T13:35:14+10:00Andreas HamacherUbuntu20baseimageChanges in this branch:
- gitlab pipeline changed to a fix stack which is not going to be rebuild everytime. This should improve reliability
- all base images changed to ubuntu 20. I might add a centos 7 compute back in later
- fix for r...Changes in this branch:
- gitlab pipeline changed to a fix stack which is not going to be rebuild everytime. This should improve reliability
- all base images changed to ubuntu 20. I might add a centos 7 compute back in later
- fix for re-running the mysql role where setting the root PW works but updating fails
- minor ubuntu fixes for mysql role, rsyslog and ldaphttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/502update the ldapserver role to function on Ubuntu2021-09-08T14:11:42+10:00Chris Hinesupdate the ldapserver role to function on Ubuntu