HPCasCode merge requestshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests2022-05-05T12:41:34+10:00https://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/556removing deprecated flag TaskAffinity=yes for slurm212022-05-05T12:41:34+10:00Andreas Hamacherremoving deprecated flag TaskAffinity=yes for slurm21Chris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/555managementnodes also need a bigger tmpfs2022-04-21T16:37:03+10:00Andreas Hamachermanagementnodes also need a bigger tmpfsSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/554applying MIG via systemd to make it reboot persistent2022-04-26T15:55:06+10:00Andreas Hamacherapplying MIG via systemd to make it reboot persistentnvidia does some very strange deployments here....
systemd is calling an nvidia script which expects a config file in a certain place and uses a local environmental variable to parameterize the systemd thing
who am I to judge :smile:
...nvidia does some very strange deployments here....
systemd is calling an nvidia script which expects a config file in a certain place and uses a local environmental variable to parameterize the systemd thing
who am I to judge :smile:
This change is not calling the parted binary directly but installs the systemd service and enables/starts it to gain persistency....
tested on qcif-node01Chris HinesChris Hineshttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/553Missing quotes.2022-04-07T12:01:00+10:00Jay Van SchyndelMissing quotes.Oops, missing quotes.Oops, missing quotes.Jay Van SchyndelJay Van Schyndelhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/552Jay2022-04-07T11:12:21+10:00Jay Van SchyndelJayAdded variables for user defined Cuda and libcudnn versions.Added variables for user defined Cuda and libcudnn versions.Jay Van SchyndelJay Van Schyndelhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/551Mpiuplift2022-04-14T15:37:18+10:00Andreas HamacherMpiupliftSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/550Added checks for an NVIDIA GPU, mig compatible GPU.2022-03-15T13:31:20+11:00Jay Van SchyndelAdded checks for an NVIDIA GPU, mig compatible GPU.Only runs when an NVIDIA GPU is detected, mig is only setup/configured when that GPU supports MIG.Only runs when an NVIDIA GPU is detected, mig is only setup/configured when that GPU supports MIG.Jay Van SchyndelJay Van Schyndelhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/549remove most nhc checks as its difficult to link the number of CPUs the...2022-03-02T12:20:35+11:00Chris Hinesremove most nhc checks as its difficult to link the number of CPUs the...remove most nhc checks as its difficult to link the number of CPUs the instance will be created with to the nhc fileremove most nhc checks as its difficult to link the number of CPUs the instance will be created with to the nhc fileAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/548adding role for x11 forwarding2022-02-03T19:03:54+11:00Andreas Hamacheradding role for x11 forwardingto be used for monarch_loginsto be used for monarch_loginsSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/547Buildslurmwithpmi2022-04-27T14:28:55+10:00Andreas HamacherBuildslurmwithpmiwe have nodes without pmi which is impacting some jobs
https://monasheresearch.freshdesk.com/a/tickets/28868
code tested on massive004. It only works on new slurm installs.
to fix the nodes already in the queue run
```
cd /opt/src/sl...we have nodes without pmi which is impacting some jobs
https://monasheresearch.freshdesk.com/a/tickets/28868
code tested on massive004. It only works on new slurm installs.
to fix the nodes already in the queue run
```
cd /opt/src/slurm-20.02.7/contribs/pmi;sudo make install
cd /opt/src/slurm-20.02.7/contribs/pmi2;sudo make install
```Trung NguyenTrung Nguyenhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/546Made /var/lib/sssd 80 M instead of 40M2022-01-14T19:15:56+11:00Simon MichnowiczMade /var/lib/sssd 80 M instead of 40MAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/545adding ucx dependency2022-01-05T12:26:55+11:00Andreas Hamacheradding ucx dependencyself merging since this is a very safe "trivial" change and should not rot in MR landself merging since this is a very safe "trivial" change and should not rot in MR landAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/544adding munge dependencies2021-12-14T14:56:09+11:00Andreas Hamacheradding munge dependenciesSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/543debian lustre moduls have different names2021-12-09T16:34:07+11:00Andreas Hamacherdebian lustre moduls have different nameslustre packages need to be removed if the kernel changes.
the kernel modules need to be rebuild, but AFTER the mellanox module is build and I don't know how to configure an order inn dkmslustre packages need to be removed if the kernel changes.
the kernel modules need to be rebuild, but AFTER the mellanox module is build and I don't know how to configure an order inn dkmsSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/542fixing path for ubuntu2021-12-09T15:50:34+11:00Andreas Hamacherfixing path for ubuntuthis path exists on centos7 and ubuntu.
ubuntu currently keeps installing the driver everytime I run the rolethis path exists on centos7 and ubuntu.
ubuntu currently keeps installing the driver everytime I run the roleSimon MichnowiczSimon Michnowiczhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/541the cronname needs to be unique, fixing path for symlinker script2021-12-09T11:28:15+11:00Andreas Hamacherthe cronname needs to be unique, fixing path for symlinker scripthttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/540Update symlinker.sh.j2 to simplify ansible template2021-12-09T11:14:29+11:00Kerri WaitUpdate symlinker.sh.j2 to simplify ansible templateAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/539Update symlinker.sh.j2 to fix Kerri's stupid typo2021-12-08T17:17:28+11:00Kerri WaitUpdate symlinker.sh.j2 to fix Kerri's stupid typoAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/538Cleaning up symlinker.sh with clearer variable names and better find command ...2021-12-08T17:04:03+11:00Kerri WaitCleaning up symlinker.sh with clearer variable names and better find command instead of lsAndreas HamacherAndreas Hamacherhttps://gitlab.erc.monash.edu.au/hpc-team/HPCasCode/-/merge_requests/537Symlink cronjob role and template2021-12-08T13:48:06+11:00Kerri WaitSymlink cronjob role and templateExample plays to invoke the role for monarch and m3
```yaml
- hosts: monarch-login1
gather_facts: False
vars:
lustre_mount: "/monfs00"
local_directory_path: "/mnt/lustre"
lustre_storage_types:
- projects
- sc...Example plays to invoke the role for monarch and m3
```yaml
- hosts: monarch-login1
gather_facts: False
vars:
lustre_mount: "/monfs00"
local_directory_path: "/mnt/lustre"
lustre_storage_types:
- projects
- scratch
roles:
- { role: lustre-symlinks }
- hosts: m3-login1
gather_facts: False
vars:
lustre_mount: "/fs02"
local_directory_path: "/projects"
lustre_storage_types:
- projects
roles:
- { role: lustre-symlinks }
- hosts: m3-login1
gather_facts: False
vars:
lustre_mount: "/fs03"
local_directory_path: "/scratch"
lustre_storage_types:
- scratch
roles:
- { role: lustre-symlinks }
```Andreas HamacherAndreas Hamacher