Skip to content
Snippets Groups Projects
Commit a9b35f6a authored by Gin Tan's avatar Gin Tan
Browse files

Merge branch 'slurm-trigger' into 'master'

This adds sslurm-trigger events to your current slurm system.

See merge request hpc-team/ansible_cluster_in_a_box!199
parents 52d24c8e 052064ff
No related branches found
No related tags found
1 merge request!199This adds sslurm-trigger events to your current slurm system.
THis role sets up trigger events on your slurm cluster.
What you want the triggers to do is up to you, so you will probably modify the templated shell files.
Copy the role to a local role directory?
Triggers used in this role as it stands
- primary_slurmctld_failure
- primary_slurmctld_resumed_operation.sh
- node down
USAGE:
- hosts: 'ManagementNodes'
tasks:
- include_vars: vars/slurm.yml
- hosts: 'ManagementNodes'
roles:
- slurm_trigger
The role uses several variables that need to be defined:
{{ slurm_dir }} The directory of slurm install. Shell scripts are copied to sbin
{{ admin_email }} Email address (defined in slurm.yml, or defined some other way) to send alerts to
Each trigger has 2 files. One to respond to a trigger. And one to reset the trigger. The role calls the last one to start the process.
---
############################
- name: template primary_slurmctld_failure
template: dest="{{ slurm_dir }}/sbin/primary_slurmctld_failure.sh" src=primary_slurmctld_failure.sh.j2 mode="0755"
become: true
become_user: root
- name: template set primary_slurmctld_failure trigger
template: dest="{{ slurm_dir }}/sbin/set_primary_slurmctld_failure_trigger.sh" src=set_primary_slurmctld_failure_trigger.sh.j2 mode="0755"
become: true
become_user: root
- name: Execute set_primary_slurmctld_failure)trigger
command: "{{ slurm_dir }}/sbin/set_primary_slurmctld_failure_trigger.sh"
become: true
become_user: slurm
run_once: true
- name: template primary_slurmctld_resumed_operation
template: dest="{{ slurm_dir }}/sbin/primary_slurmctld_resumed_operation.sh" src=primary_slurmctld_resumed_operation.sh.j2 mode="0755"
become: true
become_user: root
- name: template set primary_slurmctld_resumed trigger
template: dest="{{ slurm_dir }}/sbin/set_primary_slurmctld_resumed_operation_trigger.sh" src=set_primary_slurmctld_resumed_operation_trigger.sh.j2 mode="0755"
become: true
become_user: root
- name: Execute primary_slurmctld_resumed_operation.sh
command: "{{ slurm_dir }}/sbin/set_primary_slurmctld_resumed_operation_trigger.sh"
become: true
become_user: slurm
run_once: true
- name: template node_down
template: dest="{{ slurm_dir }}/sbin/node_down.sh" src=node_down.sh.j2 mode="0755"
become: true
become_user: root
- name: template node_down trigger command
template: dest="{{ slurm_dir }}/sbin/set_node_trigger.sh" src=set_node_trigger.sh.j2 mode="0755"
become: true
become_user: root
- name: Execute set_node_trigger.sh
command: "{{ slurm_dir }}/sbin/set_node_trigger.sh"
become: true
become_user: slurm
run_once: true
#!/bin/bash
# Notify the administrator of the failure using by e-mail
echo "On `hostname`:`date`:`whoami`: slurm-trigger event for NODE_FAILURE: $*" | `which mail` -s "NODE FAILURE $*" {{ admin_email }}
# Submit trigger for next primary slurmctld failure event
TRIGGER_CMD="{{ slurm_dir }}/sbin/set_node_trigger.sh"
FILE=/tmp/node_down.txt
#COMMAND="su slurm -c $TRIGGER_CMD"
echo "node_down.sh: `date`: `whoami`: $TRIGGER_CMD" >> $FILE
$TRIGGER_CMD >> $FILE 2>&1
#!/bin/bash
# Notify the administrator of the failure using by e-mail
echo "On `hostname`:`date`:`who`: slurm-trigger event for Primary_SLURMCTLD_FAILURE" | `which mail` -s Primary_SLURMCTLD_FAILURE {{ admin_email }}
# Submit trigger for next primary slurmctld failure event
TRIGGER_CMD="{{ slurm_dir }}/sbin/set_primary_slurmctld_failure_trigger.sh"
FILE=/tmp/primary_down.txt
#COMMAND="su slurm -c $TRIGGER_CMD"
echo "primary_slurmctld_failure.sh:`date`:`whoami`: $TRIGGER_CMD" >> $FILE
$TRIGGER_CMD >> $FILE 2>&1
#!/bin/bash
# Notify the administrator of the failure using by e-mail
echo "On `hostname`:`date`:`whoami`: slurm-trigger event for Primary_SLURMCTLD_RESUMED" | `which mail` -s Primary_SLURMCTLD_RESUMED {{ admin_email }}
# Submit trigger for next primary slurmctld failure event
FILE=/tmp/primary_up.txt
#COMMAND="su slurm -c {{ slurm_dir }}/sbin/set_primary_slurmctld_resumed_operation_trigger.sh"
COMMAND="{{ slurm_dir }}/sbin/set_primary_slurmctld_resumed_operation_trigger.sh"
echo "primary_slurmctld_resumed_operation.sh.sh:`date`:`whoami`: $COMMAND" >> $FILE
$COMMAND >> $FILE 2>&1
#!/bin/bash
TRIGGER_CMD="{{ slurm_dir }}/bin/strigger --set --down --program={{ slurm_dir }}/sbin/node_down.sh"
echo "set_node_trigger.sh: `date`: $TRIGGER_CMD"
$TRIGGER_CMD
#!/bin/bash
TRIGGER_CMD="{{ slurm_dir }}/bin/strigger --set --primary_slurmctld_failure --program={{ slurm_dir }}/sbin/primary_slurmctld_failure.sh"
echo "set_primary_slurmctld_failure_trigger.sh: `date`: $TRIGGER_CMD"
$TRIGGER_CMD
#!/bin/bash
TRIGGER_CMD="{{ slurm_dir }}/bin/strigger --set --primary_slurmctld_resumed_operation --program={{ slurm_dir }}/sbin/primary_slurmctld_resumed_operation.sh"
echo "set_primary_slurmctld_resumed_operation_trigger.sh: `date`: $TRIGGER_CMD"
$TRIGGER_CMD
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment