HPCasCode merge requestshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests2021-12-07T21:30:47+11:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/536bugfix traffic class was not set on VMs2021-12-07T21:30:47+11:00Andreas Hamacherbugfix traffic class was not set on VMsTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/535service reliability improved. still not super satisfied2021-12-07T20:14:16+11:00Andreas Hamacherservice reliability improved. still not super satisfied- ExecStartPost may be interrupted.
- The sleep is following a recommendation of RC. I have seen this work only on a second preboot run before so I guess a sleep is a good idea- ExecStartPost may be interrupted.
- The sleep is following a recommendation of RC. I have seen this work only on a second preboot run before so I guess a sleep is a good ideaTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/534disabling global pause on hypervisors2021-12-06T17:03:57+11:00Andreas Hamacherdisabling global pause on hypervisorsTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/533Mellanoxcfgchg2021-12-06T15:37:27+11:00Andreas HamacherMellanoxcfgchgTrung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/532minor maintenance improvement2021-12-02T20:18:16+11:00Andreas Hamacherminor maintenance improvementAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/531Update telegraf version from 1.15 to 1.202021-11-29T12:22:01+11:00Kerri WaitUpdate telegraf version from 1.15 to 1.20Chris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/530Xconf gen2021-11-29T12:42:51+11:00Chris HinesXconf genAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/529Update roles/telegraf/templates/telegraf.conf.j22021-11-26T20:05:43+11:00Kerri WaitUpdate roles/telegraf/templates/telegraf.conf.j2https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/527Baremetalfixes2021-11-05T10:14:29+11:00Andreas HamacherBaremetalfixesSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/526Xorgconfgen2021-11-04T17:06:06+11:00Andreas HamacherXorgconfgenSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/525adding an option to specify a nopasswd user to the role because we cannot jus...2021-11-04T17:01:32+11:00Andreas Hamacheradding an option to specify a nopasswd user to the role because we cannot just...adding an option to specify a nopasswd user to the role because we cannot just rely on the OS-image having that
an example playbook line would look like: - { role: enable_sudo_group, nopasswd_user: "ec2-user", tags: [ sudo, authenticati...adding an option to specify a nopasswd user to the role because we cannot just rely on the OS-image having that
an example playbook line would look like: - { role: enable_sudo_group, nopasswd_user: "ec2-user", tags: [ sudo, authentication ] }Simon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/523reboots a node after selinux was disabled to macke the change take effect.2022-04-29T11:11:01+10:00Andreas Hamacherreboots a node after selinux was disabled to macke the change take effect.Simon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/522changing name of monashhpc_epel to not clash with the public epel2021-11-04T17:05:05+11:00Andreas Hamacherchanging name of monashhpc_epel to not clash with the public epelSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/521Mig2022-01-17T17:06:12+11:00Chris HinesMigAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/520Add pam_slurm_adopt for ubuntu nodes2021-10-29T09:14:35+11:00Chris HinesAdd pam_slurm_adopt for ubuntu nodeshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/519Mount nvme disks on /mnt/nvme2021-10-29T09:01:04+11:00Chris HinesMount nvme disks on /mnt/nvmeaddresses https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34addresses https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/issues/34https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/515Mellanox config2021-11-09T13:21:36+11:00Andreas HamacherMellanox configfactoring out mellanox_config from mellanox_install
TODO:
```
Host - Configuration of MLNX NIC:
Note: Disable Global Pause on all NICs
Step 1 - Set QoS parameters
### Use all commands and setup a startup script to apply on boot
Set DS...factoring out mellanox_config from mellanox_install
TODO:
```
Host - Configuration of MLNX NIC:
Note: Disable Global Pause on all NICs
Step 1 - Set QoS parameters
### Use all commands and setup a startup script to apply on boot
Set DSCP (L3) as trust mode for the NIC
```
- [x] # mlnx_qos -i <interface> --trust dscp
```
Set ToS to 106 (DSCP 26) for ALL RoCE traffic (Note: This command is nonpersistent)
```
- [x] # echo 106 > /sys/class/infiniband/<mlx-device>/tc/1/traffic_class
```
Set the RDMA-CM ToS to 106 (DSCP 26) (Note: This command is nonpersistent)
```
- [x] # cma_roce_tos -d <mlx-device> -t 106
```
Enable ECN for TCP traffic (Note: This command is nonpersistent)
```
- [x] # sysctl -w net.ipv4.tcp_ecn=1
```
Step 2 - Enable PFC on RoCE prioritry
Activate PFC on priority 3
Using mlnx_qos tool (Note: This command is nonpersistent):
```
- [x] # mlnx_qos -i <interface> --pfc 0,0,0,1,0,0,0,0
�```
Tuning of MLNX NIC:
Mellanox NIC tuning can be done by using the mlnx_tune tool included in the MLX OFED Drivers. It can also be downloaded separately if you are using the inbox drivers.
From discussion with Monash, we do not need to tune the cards as the default profile is working correctly.
Commands to run:
mlnx_tune -r
- Show a report of the current MLX NIC and System
```
- [x] # mlnx_tune -p <profile-name> //There are several profiles to select from, see below. High Throughput:Trung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/514Update telegraf_slurmstats.py2021-10-19T12:21:26+11:00Chris HinesUpdate telegraf_slurmstats.pyhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/513modifications to playbooks because 1. we're not using ldap 2. we're mounting...2021-11-10T15:05:29+11:00Chris Hinesmodifications to playbooks because 1. we're not using ldap 2. we're mounting...There are lots of minor commits to make the pipeline work with a different structure
now using ansible to create and destroy openstack resources
using make_files and template in CICD to calculate the contents of files and vars so we don'...There are lots of minor commits to make the pipeline work with a different structure
now using ansible to create and destroy openstack resources
using make_files and template in CICD to calculate the contents of files and vars so we don't need calculate* roles any more
Using inventory.yml format rather than the crazy bash hack that returns jsonAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/512updating spank plugin2021-10-15T12:36:16+11:00Andreas Hamacherupdating spank pluginUpdated spank plugin after complaints on ubuntu.
testd on m3t000 and m3f031
```
[username@m3-login1 ~]$ srun --partition=desktop --reservation=AWX -w m3f031 --qos=desktopq "hostname"
m3f031
```
`...Updated spank plugin after complaints on ubuntu.
testd on m3t000 and m3f031
```
[username@m3-login1 ~]$ srun --partition=desktop --reservation=AWX -w m3f031 --qos=desktopq "hostname"
m3f031
```
```
[root@m3f031 ~]# cat /var/log/slurmd.log
[2021-10-15T10:57:38.513] Node reconfigured socket/core boundaries SocketsPerBoard=1:3(hw) CoresPerSocket=3:1(hw)
[2021-10-15T10:57:38.513] Message aggregation disabled
[2021-10-15T10:57:38.515] CPU frequency setting not configured for this node
[2021-10-15T10:57:38.516] slurmd version 20.02.7 started
[2021-10-15T10:57:38.517] error: Invalid PrologSlurmctld(`/opt/slurm-latest/etc/slurmctld.prolog`): No such file or directory
[2021-10-15T10:57:38.517] slurmd started on Fri, 15 Oct 2021 10:57:38 +1100
[2021-10-15T10:57:40.792] CPUs=3 Boards=1 Sockets=3 Cores=1 Threads=1 Memory=13869 TmpDisk=30172 Uptime=3182687 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-10-15T11:22:13.602] _run_prolog: run job script took usec=362669
[2021-10-15T11:22:13.619] _run_prolog: prolog with lock for job 20944182 ran for 0 seconds
[2021-10-15T11:22:13.852] [20944182.extern] task/cgroup: /slurm/uid_11436/job_20944182: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:13.866] [20944182.extern] task/cgroup: /slurm/uid_11436/job_20944182/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.079] launch task 20944182.0 request from UID:11436 GID:10025 HOST:172.16.202.163 PORT:47788
[2021-10-15T11:22:15.080] lllp_distribution jobid [20944182] implicit auto binding: sockets,one_thread, dist 8192
[2021-10-15T11:22:15.080] _task_layout_lllp_cyclic
[2021-10-15T11:22:15.080] _lllp_generate_cpu_bind jobid [20944182]: mask_cpu,one_thread, 0x1
[2021-10-15T11:22:15.104] [20944182.0] _setup_stepd_job_info: SLURM_STEP_RESV_PORTS found 12261-12262
[2021-10-15T11:22:15.119] [20944182.0] task/cgroup: /slurm/uid_11436/job_20944182: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.126] [20944182.0] task/cgroup: /slurm/uid_11436/job_20944182/step_0: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:22:15.140] [20944182.0] task_p_pre_launch: Using sched_affinity for tasks
[2021-10-15T11:22:15.180] [20944182.0] done with job
[2021-10-15T11:22:15.214] [20944182.extern] done with job
[2021-10-15T11:23:03.434] _run_prolog: run job script took usec=207665
[2021-10-15T11:23:03.443] _run_prolog: prolog with lock for job 20944190 ran for 0 seconds
[2021-10-15T11:23:03.581] [20944190.extern] task/cgroup: /slurm/uid_11436/job_20944190: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:03.595] [20944190.extern] task/cgroup: /slurm/uid_11436/job_20944190/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.797] launch task 20944190.0 request from UID:11436 GID:10025 HOST:172.16.202.163 PORT:16045
[2021-10-15T11:23:04.798] lllp_distribution jobid [20944190] implicit auto binding: sockets,one_thread, dist 8192
[2021-10-15T11:23:04.798] _task_layout_lllp_cyclic
[2021-10-15T11:23:04.798] _lllp_generate_cpu_bind jobid [20944190]: mask_cpu,one_thread, 0x1
[2021-10-15T11:23:04.826] [20944190.0] _setup_stepd_job_info: SLURM_STEP_RESV_PORTS found 12265-12266
[2021-10-15T11:23:04.838] [20944190.0] task/cgroup: /slurm/uid_11436/job_20944190: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.845] [20944190.0] task/cgroup: /slurm/uid_11436/job_20944190/step_0: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2021-10-15T11:23:04.859] [20944190.0] task_p_pre_launch: Using sched_affinity for tasks
[2021-10-15T11:23:04.915] [20944190.0] done with job
[2021-10-15T11:23:04.949] [20944190.extern] done with job
```Simon MichnowiczSimon Michnowicz