Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
H
HPCasCode
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
hpc-team
HPCasCode
Commits
420d334e
Commit
420d334e
authored
4 years ago
by
Andreas Hamacher
Browse files
Options
Downloads
Patches
Plain Diff
trying to calculate nhc.conf
Former-commit-id:
6c4d2f97
parent
1e705cc9
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
CICD/files/.gitignore
+1
-0
1 addition, 0 deletions
CICD/files/.gitignore
CICD/files/nhc.conf
+0
-119
0 additions, 119 deletions
CICD/files/nhc.conf
CICD/plays/computenodes.yml
+1
-0
1 addition, 0 deletions
CICD/plays/computenodes.yml
with
2 additions
and
119 deletions
CICD/files/.gitignore
+
1
−
0
View file @
420d334e
nhc.conf
ssh_known_hosts
ssh_known_hosts
slurm.conf
slurm.conf
slurmdbd.conf
slurmdbd.conf
...
...
This diff is collapsed.
Click to expand it.
CICD/files/nhc.conf
deleted
100644 → 0
+
0
−
119
View file @
1e705cc9
#######################################################################
###
### Filesystem checks
###
* ||
check_fs_used
/
90
%
# * || check_fs_iused / 100%
# * || check_fs_iused /glusterVolume 100%
#not that useful at this stage as Nagios should be monitoring servers.
# just check the file servers are happy
#* || check_fs_used '/usr/local' 95%
#* || check_fs_used '/home' 95%
#* || check_fs_used '/projects' 95%
#* || check_fs_used '/scratch' 95%
#* || check_fs_used '/' 100%
#
# New syntax: check_fs_mount [ -0 ] [ -r ] [ -s src ] [ -t fstype ] [ -o options ] [ -O mount_options ] [ -e cmd ] [ -E cmd ] -f fs
#
# m3a0[16-20] nodes are currently disabled from mounting /usr/local. The check is disabled on all m3a because I didn't want to be more specific -- Chris 20180212
m3a
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3c
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3d
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3e
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3f
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3g
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3h
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3i
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3m
* ||
check_fs_mount_rw
-
f
'/usr/local'
m3p
* ||
check_fs_mount_rw
-
f
'/usr/local'
dgx
* ||
check_fs_mount_rw
-
f
'/usr/local'
* ||
check_fs_mount_rw
-
f
'/home'
* ||
check_fs_mount_rw
-
f
'/projects'
* ||
check_fs_mount_rw
-
f
'/scratch'
* ||
check_lustre_health
#check numa config
m3a
* ||
check_numa
m3c
* ||
check_numa
m3d
* ||
check_numa
m3e
* ||
check_numa
m3g
* ||
check_numa
m3h
* ||
check_numa
m3i
* ||
check_numa
m3m
* ||
check_numa
m3p
* ||
check_numa
* ||
check_SSSD
* ||
check_user_lookup
#######################################################################
###
### Hardware checks
###
# Don't check_hw_eth eth0 because most of our compute nodes have eth1 not eth0, but I won't guarantee this
# This has to do with renaming eth0 to mlx0 (i.e. the mellanox device) but is senstive to device initialisation order I suspect
# Chris Hines 20160907
# * || check_hw_cpuinfo 1 1 1
# * || check_hw_physmem 4048416kB 4048416kB 3%
* ||
check_hw_swap
0
kB
0
kB
3
%
* ||
check_hw_eth
lo
!
dgx
* ||
check_hw_eth
mlx0
dgx
* ||
check_hw_eth
bond0
.
113
dgx
* ||
check_hw_eth
bond0
.
114
* ||
check_ibv_devinfo
###
### ECC not available on m3f K1
m3c
* ||
check_gpu_ecc
m3e
* ||
check_gpu_ecc
# m3c* have 4 gpus
# m3e* have 8 gpus
# m3f* have 3 gpus
# m3g* have 3 gpus
# m3h* have 2 gpus
# m3p* have 6 gpus
# dgx* have 8 gpus
m3c
* ||
check_num_of_gpu
4
m3e
* ||
check_num_of_gpu
8
m3f
* ||
check_num_of_gpu
1
m3g
* ||
check_num_of_gpu
3
m3h
* ||
check_num_of_gpu
2
m3p
* ||
check_num_of_gpu
6
dgx
* ||
check_num_of_gpu
8
#add more here
m3c
* ||
check_nvidia_device_existance
m3e
* ||
check_nvidia_device_existance
m3f
* ||
check_nvidia_device_existance
m3g
* ||
check_nvidia_device_existance
m3h
* ||
check_nvidia_device_existance
m3p
* ||
check_nvidia_device_existance
dgx
* ||
check_nvidia_device_existance
# Kerri Wait 20170830 Add new check for xorg.conf file to ensure vglrun works
m3c
* ||
check_xorg_conf_file_existance
m3f
* ||
check_xorg_conf_file_existance
m3g
* ||
check_xorg_conf_file_existance
m3h
* ||
check_xorg_conf_file_existance
m3p
* ||
check_xorg_conf_file_existance
#######################################################################
###
### Process checks
###
* ||
check_ps_service
-
S
-
u
root
sshd
* ||
check_ps_service
-
S
ntpd
#Check for UVM for m3c and m3h
m3c
* ||
check_nvidia_uvm
m3g
* ||
check_nvidia_uvm
m3h
* ||
check_nvidia_uvm
m3p
* ||
check_nvidia_uvm
#dgx* || check_nvidia_uvm
This diff is collapsed.
Click to expand it.
CICD/plays/computenodes.yml
+
1
−
0
View file @
420d334e
...
@@ -72,6 +72,7 @@
...
@@ -72,6 +72,7 @@
roles
:
roles
:
-
{
role
:
slurm-common
,
tags
:
[
slurm
,
slurmbuild
]
}
-
{
role
:
slurm-common
,
tags
:
[
slurm
,
slurmbuild
]
}
-
{
role
:
slurm_config
,
tags
:
[
slurm_config
,
slurm
]
}
-
{
role
:
slurm_config
,
tags
:
[
slurm_config
,
slurm
]
}
-
{
role
:
calculateNhcConfig
,
tags
:
[
nhc
,
slurm
]
}
-
{
role
:
nhc
,
tags
:
[
nhc
,
slurm
]
}
-
{
role
:
nhc
,
tags
:
[
nhc
,
slurm
]
}
-
{
role
:
slurm-start
,
start_slurmd
:
True
,
tags
:
[
slurm
,
slurm-start
]
}
-
{
role
:
slurm-start
,
start_slurmd
:
True
,
tags
:
[
slurm
,
slurm-start
]
}
-
{
role
:
vncserver
,
tags
:
[
other
]
}
-
{
role
:
vncserver
,
tags
:
[
other
]
}
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment