Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • hpc-team/HPCasCode
  • chines/ansible_cluster_in_a_box
2 results
Show changes
Showing
with 1497 additions and 292 deletions
nhc_version: 1.4.2
nhc_src_url: https://codeload.github.com/mej/nhc/tar.gz/refs/tags/1.4.2
nhc_src_checksum: "sha1:766762d2c8cd81204b92d4921fb5b66616351412"
nhc_src_dir: /opt/src/nhc-1.4.2
nhc_dir: /opt/nhc-1.4.2
slurm_version: 21.08.8
slurm_src_url: https://download.schedmd.com/slurm/slurm-21.08.8.tar.bz2
slurm_src_checksum: "sha1:7d37dbef37b25264a1593ef2057bc423e4a89e81"
slurm_version: 22.05.3
slurm_src_url: https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2
slurm_src_checksum: "sha1:55e9a1a1d2ddb67b119c2900982c908ba2846c1e"
slurm_src_dir: /opt/src/slurm-{{ slurm_version }}
slurm_dir: /opt/slurm-{{ slurm_version }}
ucx_version: 1.8.0
ucx_src_url: https://github.com/openucx/ucx/releases/download/v1.8.0/ucx-1.8.0.tar.gz
ucx_src_checksum: "sha1:96f2fe1918127edadcf5b195b6532da1da3a74fa"
ucx_src_dir: /opt/src/ucx-1.8.0
ucx_dir: /opt/ucx-1.8.0
munge_version: 0.5.14
munge_src_url: https://github.com/dun/munge/archive/refs/tags/munge-0.5.14.tar.gz
munge_src_checksum: "sha1:70f6062b696c6d4f17b1d3bdc47c3f5eca24757c"
munge_dir: /opt/munge-0.5.14
munge_src_dir: /opt/src/munge-munge-0.5.14
nvidia_mig_parted_version: 0.1.3
nvidia_mig_parted_src_url: https://github.com/NVIDIA/mig-parted/archive/refs/tags/v0.1.3.tar.gz
nvidia_mig_parted_src_checksum: "sha1:50597b4a94348c3d52b3234bb22783fa236f1d53"
nvidia_mig_parted_src_dir: /opt/src/mig-parted-0.1.3
nvidia_mig_slurm_discovery_version: master
nvidia_mig_slurm_discovery_src_url: https://gitlab.com/nvidia/hpc/slurm-mig-discovery.git
nvidia_mig_slurm_discovery_src_dir: /opt/src/mig-slurm_discovery
nvidia_cuda_version: cuda
nvidia_libcudnn_version: libcudnn8-dev
This diff is collapsed.
ansible_cluster_in_a_box
========================
**HPCasCode**
=============
The aim of this repo is to provide a set or ansible roles that can be used to deploy a cluster
We are working from
https://docs.google.com/a/monash.edu/spreadsheets/d/1IZNE7vMid_SHYxImGVtQcNUiUIrs_Nu1xqolyblr0AE/edit#gid=0
as our architecture document.
[![pipeline status](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/badges/cicd/pipeline.svg)](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/commits/cicd) [Issues](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/-/issues)
We aim to make these roles as generic as possible. You should be able to start from an inventory file, an ssh key and a git clone of this and end up with a working cluster. In the longer term we might branch to include utilities to make an inventory file using NeCTAR credentials.
1. [ Introduction / Purpose ](#Introduction)
2. [ Getting started ](#gettingstarted)
3. [ Features ](#Features)
4. [ Assumptions ](#Assumptions)
5. [ Configuration ](#Configuration)
6. [ Contribute and Collaborate ](#Contribute)
7. [ Used by ](#partners)
8. [ CICD Coverage ](#coverage)
9. [ Roadmap ](#Roadmap)
If you need a password use get_or_make_password.py (delegated to the passwword server/localhost) to generate a random one that can be shared between nodes
Here is an example task (taken from setting up karaage):
- name: mysql db
mysql_db: name=karaage login_user=root login_password={{ sqlrootPasswd.stdout }}
<a name="Introduction"></a>
## Introduction / Purpose TODO
- name: karaage sql password
shell: ~/get_or_make_passwd.py karaageSQL
delegate_to: 127.0.0.1
register: karaageSqlPassword
The purpose of this repository is to deploy a **H**igh **P**erformance **C**omputing [HPC] Systems using Infrastructure-as-Code [IaC] principles predominantly using ansible. The aim is to follow the principles of InfrastructureAsCode and repurpose the principles for HPC Systems.
- name: mysql user
mysql_user: name='karaage' password={{ item }} priv=karaage.*:ALL state=present login_user=root login_password={{ sqlrootPasswd.stdout }}
with_items: karaageSqlPassword.stdout
By encoding the system state the following advantages are gained which also define the values of this project:
- **collaboration** it is simply easier to share code than systems. This repository also aims to serve as a place for backup and discussions. And in the near future even documentation :)
- **redeployability** if a system follows the same build recipes and the deployment is automated all installations should be similar even Test and Production. You will also hear the term Immutable infrastructure for this.
<!--- Just some suggestion:
# if a system follows the same build recipes`,` and the deployment is automated, all installations should be similar `including` Test and Production. We would also use the term Immutable Infrastructure to describe this state. -->
- **CICD automation** aka Test- and change automation allow us to put safegard in place increasing our quality and also to ease the burden of change management.
<!--- - **CICD automation and change automation** it allow us to put `safeguard` in place`,` increasing our quality and `easing` the burden of change management. -->
- **modular and reusable** For example we currently support two Operation system and also Baremetal and openstack based deployment. AWS support is also in scope.
<!--- # - **modular and reusable** it allows us to support two operation such as two operating systems, and also Baremetal and openstack based deployment. AWS support is also in scope. -->
The scope if this repository ranges from an ansible inventory ( although openstack and later AWS will be supported ) and ends with a gpu powered desktop provisioning system ontop of the ressource scheduler slurm. Services for example an identity system or package repository management are in scope but future work.
<!--- The scope of this repository includes an ansible inventory with Openstack and AWS support to come, as well as GPU-desktop provisioning on top of Slurm workload manager. Services such as identify system and package repository are also in scope for future work. -->
We aim to make these roles run on all common linux platforms (both RedHat and Debian derived) but at the very least they should work on a CentOS 6 install.
<a name="gettingstarted"></a>
## Getting Started TODO
Ok all CAPS incoming. TAKE CARE OF CICD/vars/passwords.yml best case, use ansible vault at the same time you start modifying that file.
Given the current state of documentation please feel free to reach out and ask for clarification. This project or repository is far from done or being perfect!
Yaml syntax can be checked at http://www.yamllint.com/
<a name="Features"></a>
## Features
- Currently supports Centos 7, Centos 8 and Ubuntu 1804
- CICD tested including spawning a cluster, MPI and slurm tests
- Rolling node updates is currently work in progress.
- Coming up: strudel2 desktop integration
<a name="Assumptions"></a>
## Assumptions
- The ansible inventory file needs to be in the following format: nodetypes 1x[SQLNodes], 1x[NFSNodes], 2x[ManagementNodes], >=0 LoginNodes, >=0 ComputeNodes, >=0 VisNodes ( GPU Nodes ) . SQL and NFS can be combined. The ManagementNodes are managing slurm access and job submission. Here is an [example](docs/sampleinventory.yml)
- The filesystem layout assumes the following shared drives /home, /projects for project data, /scratch for fast compute, /usr/local for provided software see [here](CICD/files/etcExports)
- TODO rewrite a populated vars folder as in the CICD subfolder. See configuration for any details and apologies
- The software stack is currently provided as a mounted shared drive. Software can be loaded via Environment-Modules. Containerisation is also heavily in use. Please reach out for further details and sharing requests.
- Networking is partially defined on this level, and partially on the Baremetal level. This is for now a point to reach out and discuss. Current implementations support public facing LoginNodes and the rest being in a private Network with a NAT-Gateway. OR a private Infiniband network in conjunction with a 1G network.
- a shared ssh key for a management user
<a name="Configuration"></a>
## Configuration
Configuration is defined in the variables (vars) in the vars folder. See CICD/vars. These vars are evolving over time with the necessary refactoring and abstraction. These variables have `definitely` grown over time when this set of ansible roles was developed, first for a single system and then to multiple sytems. The best way to interpret their usage currently is to search or `grep VARIABLENAME` this repository and see how they are used. This is not pretty for external use. Lets not even pretend it is.
- filesystems.yml covers filesystem types and exports/mounts
- ldapConfig.yml covers identity via LDAP
- names.yml to store the domain name
- passwords.yml containing passwords in a single file to be encrypted via ansible-vault
- slurm.yml containing all slurm variables
- vars.yml contains variables which don't belong somewhere else
<a name="Contribute"></a>
## How do I contribute or collaborate
- Get in contact, use the issue tracker and if you want to contribute documentation or code or anything else we offer handholding for the first merge request.
- A great first start is to get in contact, tell us what you want to know and help us improve the documentation
- Please contact us via andreas.hamacher(at)monash.edu or help(at)massive.org.au
- Contribution guidelines would also be a good contribution :)
<a name="partners"></a>
## Used by:
![Monash University](docs/images/monash-university-logo.png "monash.edu")
![MASSIVE](docs/images/massive-website-banner.png "massive.org.au")
![Australian Research Data Commons](docs/images/ardc.png "ardc.edu.au")
![University of Western Australia](docs/images/university-of-western-australia-logo.png "uwa.edu.au")
<a name="Coverage"></a>
## CI Coverage
- Centos7.8, Centos8, Ubuntu1804
- All node types as outlined in Assumptions
- vars for Massive, Monarch and a Generic Cluster ( see files in CICD/vars )
- CD in progress using autoupdate.py
<a name="Roadmap"></a>
## Roadmap
- soon this section is to be moved into the Milestones on gitlab.
- Desktop integration using Strudel2. Contributors are welcome to integrate OpenOnDemand
- CVL integration see github.com/Characterisation-Virtual-Laboratory
- Automated Security Checks as part of the CICD pipeline
- Integration of a FOSS identity system currently only a token LDAP is supported
- System status monitoring and alerting
[Nice read titled Infrastructure as Code DevOps principle: meaning, benefits, use cases](https://medium.com/@FedakV/infrastructure-as-code-devops-principle-meaning-benefits-use-cases-a4461a1fef2)
---
- name: "Check client ca certificate"
register: ca_cert
stat: "path={{ x509_cacert_file }}"
- name: "Check certificate and key"
shell: (openssl x509 -noout -modulus -in {{ x509_cert_file }} | openssl md5 ; openssl rsa -noout -modulus -in {{ x509_key_file }} | openssl md5) | uniq | wc -l
register: certcheck
- name: "Check certificate"
register: cert
stat: "path={{ x509_cert_file }}"
- name: "Check key"
register: key
stat: "path={{ x509_key_file }}"
sudo: true
- name: "Default: we don't need a new certificate"
set_fact: needcert=False
- name: "Set need cert if key is missing"
set_fact: needcert=True
when: key.stat.exists == false
- name: "set needcert if cert is missing"
set_fact: needcert=True
when: cert.stat.exists == false
- name: "set needcert if cert doesn't match key"
set_fact: needcert=True
when: certcheck.stdout == '2'
- name: "Creating Keypair"
shell: "echo noop when using easy-rsa"
when: needcert
- name: "Creating CSR"
shell: " cd /etc/easy-rsa/2.0; source ./vars; export EASY_RSA=\"${EASY_RSA:-.}\"; \"$EASY_RSA\"/pkitool --csr {{ x509_csr_args }} {{ common_name }}"
when: needcert
sudo: true
- name: "Copy CSR to ansible host"
fetch: "src=/etc/easy-rsa/2.0/keys/{{ common_name }}.csr dest=/tmp/{{ common_name }}/ fail_on_missing=yes validate_md5=yes flat=yes"
sudo: true
when: needcert
- name: "Copy CSR to CA"
delegate_to: "{{ x509_ca_server }}"
copy: "src=/tmp/{{ ansible_fqdn }}/{{ common_name }}.csr dest=/etc/easy-rsa/2.0/keys/{{ common_name }}.csr force=yes"
when: needcert
sudo: true
- name: "Sign Certificate"
delegate_to: "{{ x509_ca_server }}"
shell: "source ./vars; export EASY_RSA=\"${EASY_RSA:-.}\" ;\"$EASY_RSA\"/pkitool --sign {{ common_name }}"
args:
chdir: "/etc/easy-rsa/2.0"
sudo: true
when: needcert
- name: "Copy the Certificate to ansible host"
delegate_to: "{{ x509_ca_server }}"
fetch: "src=/etc/easy-rsa/2.0/keys/{{ common_name }}.crt dest=/tmp/{{ common_name }}/ fail_on_missing=yes validate_md5=yes flat=yes"
sudo: true
when: needcert
- name: "Copy the CA Certificate to the ansible host"
delegate_to: "{{ x509_ca_server }}"
fetch: "src=/etc/easy-rsa/2.0/keys/ca.crt dest=/tmp/ca.crt fail_on_missing=yes validate_md5=yes flat=yes"
sudo: true
when: "ca_cert.stat.exists == false"
- name: "Copy the certificate to the node"
copy: "src=/tmp/{{ common_name }}/{{ common_name }}.crt dest={{ x509_cert_file }} force=yes"
sudo: true
when: needcert
- name: "Copy the CA certificate to the node"
copy: "src=/tmp/ca.crt dest={{ x509_cacert_file }}"
sudo: true
when: "ca_cert.stat.exists == false"
- name: "Copy the key to the correct location"
shell: "mkdir -p `dirname {{ x509_key_file }}` ; chmod 700 `dirname {{ x509_key_file }}` ; cp /etc/easy-rsa/2.0/keys/{{ common_name }}.key {{ x509_key_file }}"
sudo: true
when: needcert
#!/usr/bin/env python
import sys, os, string, subprocess, socket, ansible.runner, re
import copy, shlex,uuid, random, multiprocessing, time, shutil
import novaclient.v1_1.client as nvclient
import novaclient.exceptions as nvexceptions
import glanceclient.v2.client as glclient
import keystoneclient.v2_0.client as ksclient
class Authenticate:
def __init__(self, username, passwd):
self.username=username
self.passwd=passwd
self.tenantName= os.environ['OS_TENANT_NAME']
self.authUrl="https://keystone.rc.nectar.org.au:5000/v2.0"
kc = ksclient.Client( auth_url=self.authUrl,
username=self.username,
password=self.passwd)
self.tenantList=kc.tenants.list()
self.novaSemaphore = multiprocessing.BoundedSemaphore(value=1)
def createNovaObject(self,tenantName):
for tenant in self.tenantList:
if tenant.name == tenantName:
try:
nc = nvclient.Client( auth_url=self.authUrl,
username=self.username,
api_key=self.passwd,
project_id=tenant.name,
tenant_id=tenant.id,
service_type="compute"
)
return nc
except nvexceptions.ClientException:
raise
def gatherInfo(self):
for tenant in self.tenantList: print tenant.name
tenantName = raw_input("Please select a project: (Default MCC-On-R@CMON):")
if not tenantName or tenantName not in [tenant.name for tenant in self.tenantList]:
tenantName = "MCC_On_R@CMON"
print tenantName,"selected\n"
## Fetch the Nova Object
nc = self.createNovaObject(tenantName)
## Get the Flavor
flavorList = nc.flavors.list()
for flavor in flavorList: print flavor.name
flavorName = raw_input("Please select a Flavor Name: (Default m1.xxlarge):")
if not flavorName or flavorName not in [flavor.name for flavor in flavorList]:
flavorName = "m1.xxlarge"
print flavorName,"selected\n"
## Get the Availability Zones
az_p1 = subprocess.Popen(shlex.split\
("nova availability-zone-list"),stdout=subprocess.PIPE)
az_p2 = subprocess.Popen(shlex.split\
("""awk '{if ($2 && $2 != "Name")print $2}'"""),\
stdin=az_p1.stdout,stdout=subprocess.PIPE)
availabilityZonesList = subprocess.Popen(shlex.split\
("sort"),stdin=az_p2.stdout,stdout=subprocess.PIPE).communicate()[0]
print availabilityZonesList
availabilityZone = raw_input("Please select an availability zone: (Default monash-01):")
if not availabilityZone or \
availabilityZone not in [ zone for zone in availabilityZonesList.split()]:
availabilityZone = "monash-01"
print availabilityZone,"selected\n"
## Get the number of instances to spawn
numberOfInstances = raw_input\
("Please specify the number of instances to launch: (Default 1):")
if not numberOfInstances or \
not isinstance(int(numberOfInstances), int):
numberOfInstances = 1
subprocess.call(['clear'])
flavorObj = nc.flavors.find(name=flavorName)
print "Creating",numberOfInstances,\
"instance(s) in",availabilityZone,"zone..."
instanceList = []
for counter in range(0,int(numberOfInstances)):
nodeName = "MCC-Node"+str(random.randrange(1,1000))
try:
novaInstance = nc.servers.create\
(name=nodeName,image="ddc13ccd-483c-4f5d-a5fb-4b968aaf385b",\
flavor=flavorObj,key_name="shahaan",\
availability_zone=availabilityZone)
instanceList.append(novaInstance)
except nvexceptions.ClientException:
raise
continue
while 'BUILD' in [novaInstance.status \
for novaInstance in instanceList]:
for count in range(0,len(instanceList)):
time.sleep(5)
if instanceList[count].status != 'BUILD':
continue
else:
try:
instanceList[count] = nc.servers.get(instanceList[count].id)
except nvexceptions.ClientException or \
nvexceptions.ConnectionRefused or \
nvexceptions.InstanceInErrorState:
raise
del instanceList[count]
continue
activeHostsList = []
SSHports = []
for novaInstance in instanceList:
if novaInstance.status == 'ACTIVE':
hostname = socket.gethostbyaddr(novaInstance.networks.values()[0][0])[0]
activeHostsList.append(hostname)
SSHDict = {}
SSHDict['IP'] = novaInstance.networks.values()[0][0]
SSHDict['status'] = 'CLOSED'
SSHports.append(SSHDict)
print "Scanning if port 22 is open..."
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
while 'CLOSED' in [host['status'] for host in SSHports]:
for instance in range(0,len(SSHports)):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
if SSHports[instance]['status'] == 'CLOSED' and not sock.connect_ex((SSHports[instance]['IP'], 22)):
SSHports[instance]['status'] = 'OPEN'
print "Port 22, opened for IP:",SSHports[instance]['IP']
else:
time.sleep(5)
sock.close()
fr = open('/etc/ansible/hosts.rpmsave','r+')
fw = open('hosts.temp','w+')
lines = fr.readlines()
for line in lines:
fw.write(line)
if re.search('\[new-servers\]',line):
for host in activeHostsList: fw.write(host+'\n')
fr.close()
fw.close()
shutil.move('hosts.temp','/etc/ansible/hosts')
print "Building the Nodes now..."
subprocess.call(shlex.split("/mnt/nectar-nfs/root/swStack/ansible/bin/ansible-playbook /mnt/nectar-nfs/root/ansible-config-root/mcc-nectar-dev/buildNew.yml -v"))
if __name__ == "__main__":
username = os.environ['OS_USERNAME']
passwd = os.environ['OS_PASSWORD']
choice = raw_input(username + " ? (y/n):")
while choice and choice not in ("n","y"):
print "y or n please"
choice = raw_input()
if choice == "n":
username = raw_input("username :")
passwd = raw_input("password :")
auth = Authenticate(username, passwd)
auth.gatherInfo()
0,0,0,1,1,1,1,1,1,0
0,0,0,0,0,0,1,1,0,0
0,0,0,0,1,1,1,1,0,0
1,0,0,0,0,0,0,1,0,0
1,0,1,0,0,0,0,1,0,0
1,0,1,0,0,0,0,1,0,0
1,1,1,0,0,0,0,1,1,0
1,1,1,1,1,1,1,0,0,0
1,0,0,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
\ No newline at end of file
docs/ChordDiagramm/Chord_Diagramm.png

1.87 MiB

#!/usr/bin/env python3
# script copied from https://github.com/fengwangPhysics/matplotlib-chord-diagram/blob/master/README.md
# source data manually edited via https://docs.google.com/spreadsheets/d/1JN9S_A5ICPQOvgyVbWJSFJiw-5gO2vF-4AeYuWl-lbs/edit#gid=0
# chord diagram
import matplotlib.pyplot as plt
from matplotlib.path import Path
import matplotlib.patches as patches
import numpy as np
LW = 0.3
def polar2xy(r, theta):
return np.array([r*np.cos(theta), r*np.sin(theta)])
def hex2rgb(c):
return tuple(int(c[i:i+2], 16)/256.0 for i in (1, 3 ,5))
def IdeogramArc(start=0, end=60, radius=1.0, width=0.2, ax=None, color=(1,0,0)):
# start, end should be in [0, 360)
if start > end:
start, end = end, start
start *= np.pi/180.
end *= np.pi/180.
# optimal distance to the control points
# https://stackoverflow.com/questions/1734745/how-to-create-circle-with-b%C3%A9zier-curves
opt = 4./3. * np.tan((end-start)/ 4.) * radius
inner = radius*(1-width)
verts = [
polar2xy(radius, start),
polar2xy(radius, start) + polar2xy(opt, start+0.5*np.pi),
polar2xy(radius, end) + polar2xy(opt, end-0.5*np.pi),
polar2xy(radius, end),
polar2xy(inner, end),
polar2xy(inner, end) + polar2xy(opt*(1-width), end-0.5*np.pi),
polar2xy(inner, start) + polar2xy(opt*(1-width), start+0.5*np.pi),
polar2xy(inner, start),
polar2xy(radius, start),
]
codes = [Path.MOVETO,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.LINETO,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CLOSEPOLY,
]
if ax == None:
return verts, codes
else:
path = Path(verts, codes)
patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
ax.add_patch(patch)
def ChordArc(start1=0, end1=60, start2=180, end2=240, radius=1.0, chordwidth=0.7, ax=None, color=(1,0,0)):
# start, end should be in [0, 360)
if start1 > end1:
start1, end1 = end1, start1
if start2 > end2:
start2, end2 = end2, start2
start1 *= np.pi/180.
end1 *= np.pi/180.
start2 *= np.pi/180.
end2 *= np.pi/180.
opt1 = 4./3. * np.tan((end1-start1)/ 4.) * radius
opt2 = 4./3. * np.tan((end2-start2)/ 4.) * radius
rchord = radius * (1-chordwidth)
verts = [
polar2xy(radius, start1),
polar2xy(radius, start1) + polar2xy(opt1, start1+0.5*np.pi),
polar2xy(radius, end1) + polar2xy(opt1, end1-0.5*np.pi),
polar2xy(radius, end1),
polar2xy(rchord, end1),
polar2xy(rchord, start2),
polar2xy(radius, start2),
polar2xy(radius, start2) + polar2xy(opt2, start2+0.5*np.pi),
polar2xy(radius, end2) + polar2xy(opt2, end2-0.5*np.pi),
polar2xy(radius, end2),
polar2xy(rchord, end2),
polar2xy(rchord, start1),
polar2xy(radius, start1),
]
codes = [Path.MOVETO,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
]
if ax == None:
return verts, codes
else:
path = Path(verts, codes)
patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
ax.add_patch(patch)
def selfChordArc(start=0, end=60, radius=1.0, chordwidth=0.7, ax=None, color=(1,0,0)):
# start, end should be in [0, 360)
if start > end:
start, end = end, start
start *= np.pi/180.
end *= np.pi/180.
opt = 4./3. * np.tan((end-start)/ 4.) * radius
rchord = radius * (1-chordwidth)
verts = [
polar2xy(radius, start),
polar2xy(radius, start) + polar2xy(opt, start+0.5*np.pi),
polar2xy(radius, end) + polar2xy(opt, end-0.5*np.pi),
polar2xy(radius, end),
polar2xy(rchord, end),
polar2xy(rchord, start),
polar2xy(radius, start),
]
codes = [Path.MOVETO,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
Path.CURVE4,
]
if ax == None:
return verts, codes
else:
path = Path(verts, codes)
patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
ax.add_patch(patch)
def chordDiagram(X, ax, colors=None, width=0.1, pad=2, chordwidth=0.7):
"""Plot a chord diagram
Parameters
----------
X :
flux data, X[i, j] is the flux from i to j
ax :
matplotlib `axes` to show the plot
colors : optional
user defined colors in rgb format. Use function hex2rgb() to convert hex color to rgb color. Default: d3.js category10
width : optional
width/thickness of the ideogram arc
pad : optional
gap pad between two neighboring ideogram arcs, unit: degree, default: 2 degree
chordwidth : optional
position of the control points for the chords, controlling the shape of the chords
"""
# X[i, j]: i -> j
x = X.sum(axis = 1) # sum over rows
ax.set_xlim(-1.1, 1.1)
ax.set_ylim(-1.1, 1.1)
if colors is None:
# use d3.js category10 https://github.com/d3/d3-3.x-api-reference/blob/master/Ordinal-Scales.md#category10
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf', '#c49c94']
if len(x) > len(colors):
print('x is too large! Use x smaller than 11')
colors = [hex2rgb(colors[i]) for i in range(len(x))]
# find position for each start and end
y = x/np.sum(x).astype(float) * (360 - pad*len(x))
pos = {}
arc = []
nodePos = []
start = 0
for i in range(len(x)):
end = start + y[i]
arc.append((start, end))
angle = 0.5*(start+end)
#print(start, end, angle)
if -30 <= angle <= 210:
angle -= 90
else:
angle -= 270
nodePos.append(tuple(polar2xy(1.1, 0.5*(start+end)*np.pi/180.)) + (angle,))
z = (X[i, :]/x[i].astype(float)) * (end - start)
ids = np.argsort(z)
z0 = start
for j in ids:
pos[(i, j)] = (z0, z0+z[j])
z0 += z[j]
start = end + pad
for i in range(len(x)):
start, end = arc[i]
IdeogramArc(start=start, end=end, radius=1.0, ax=ax, color=colors[i], width=width)
start, end = pos[(i,i)]
selfChordArc(start, end, radius=1.-width, color=colors[i], chordwidth=chordwidth*0.7, ax=ax)
for j in range(i):
color = colors[i]
if X[i, j] > X[j, i]:
color = colors[j]
start1, end1 = pos[(i,j)]
start2, end2 = pos[(j,i)]
ChordArc(start1, end1, start2, end2,
radius=1.-width, color=colors[i], chordwidth=chordwidth, ax=ax)
#print(nodePos)
return nodePos
##################################
if __name__ == "__main__":
fig = plt.figure(figsize=(6,6))
flux = np.array([
[ 0, 1, 0, 0], #OS Sum:2 ; Centos, Ubuntu
[ 0, 0, 0, 0], #Plays
[ 0, 0, 0, 1], # Cluster: Sum5; Generic, M3, Monarch, SHPC, ACCS
[ 0, 0, 1, 2] #Cloud Sum3: AWS,Nimbus,Nectar
])
from numpy import genfromtxt
flux = genfromtxt('Chord_Diagramm - Sheet1.csv', delimiter=',')
ax = plt.axes([0,0,1,1])
#nodePos = chordDiagram(flux, ax, colors=[hex2rgb(x) for x in ['#666666', '#66ff66', '#ff6666', '#6666ff']])
nodePos = chordDiagram(flux, ax)
ax.axis('off')
prop = dict(fontsize=16*0.8, ha='center', va='center')
nodes = ['OS_Centos76','OS_Centos8','OS_Ubuntu1804','PLY_NFSSQL','PLY_MGMT','PLY_Login','PLY_Compute','C_Generic','C_M3','C_Monarch']
#nodes = ['M3_MONARCH','SHPC','Ubuntu','Centos7','Centos8','Tested','Security','Nectar','?AWS?','DGX@Baremetal','ML@M3','CVL@UWA','CVL_SW','CVL_Desktop','Strudel','/usr/local']
for i in range(len(nodes)):
ax.text(nodePos[i][0], nodePos[i][1], nodes[i], rotation=nodePos[i][2], **prop)
plt.savefig("Chord_Diagramm.png", dpi=600,transparent=False,bbox_inches='tight', pad_inches=0.02)
plt.show(pad_inches=0.02)
docs/images/ardc.png

8.87 KiB

docs/images/massive-website-banner.png

22 KiB

docs/images/monash-university-logo.png

8.27 KiB

docs/images/university-of-western-australia-logo.png

22.9 KiB

[SQLNodes]
sql1 ansible_host=192.168.0.1 ansible_user=ubuntu
[NFSNodes]
nfs11 ansible_host=192.168.0.2 ansible_user=ubuntu
[ManagementNodes]
mgmt1 ansible_host=192.168.0.3 ansible_user=ubuntu
mgmt2 ansible_host=192.168.0.4 ansible_user=ubuntu
[LoginNodes]
login1 ansible_host=192.168.0.5 ansible_user=ubuntu
[ComputeNodes]
compute1 ansible_host=192.168.0.6 ansible_user=ubuntu
\ No newline at end of file
---
-
hosts: openvpn-servers
remote_user: ec2-user
roles:
- easy-rsa-common
- easy-rsa-CA
- easy-rsa-certificate
- OpenVPN-Server
- nfs-server
sudo: true
vars:
x509_ca_server: vm-118-138-240-224.erc.monash.edu.au
-
hosts: openvpn-clients
remote_user: ec2-user
roles:
- easy-rsa-common
- easy-rsa-certificate
- OpenVPN-Client
- syncExports
- nfs-client
sudo: true
vars:
x509_ca_server: vm-118-138-240-224.erc.monash.edu.au
openvpn_servers: ['vm-118-138-240-224.erc.monash.edu.au']
nfs_server: "vm-118-138-240-224.erc.monash.edu.au"
- hosts: 'ComputeNodes,DGXRHELNodes'
gather_facts: false
tasks:
- include_vars: vars/ldapConfig.yml
- include_vars: vars/filesystems.yml
- include_vars: vars/slurm.yml
- include_vars: vars/vars.yml
- { name: set use shared state, set_fact: usesharedstatedir=False }
tags: [ never ]
# these are just templates. Not the tag never! Everything with never is only executed if called explicitly aka ansible-playbook --tags=foo,bar OR -tags=tag_group
- hosts: 'ComputeNodes,DGXRHELNodes'
gather_facts: false
tasks:
- { name: template_shell, shell: ls, tags: [never,tag_group,uniquetag_foo] }
- { name: template_command, command: uname chdir=/bin, tags: [never,tag_group,uniquetag_bar] }
- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes'
gather_facts: false
tasks:
- { name: kill user bash shells, shell: 'ps aux | grep -i -e bash -e vscode-server -e zsh -e tmux -e sftp-server -e trungn | grep -v -e "ec2-user" -e ubuntu -e philipc -e smichnow | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root, tags: [never,kickshells]}
- { name: Disable MonARCH Lustre Cron Check, cron: name="Check dmesg for lustre errors" state=absent,become_user: root,become: True ,tags: [never, monarch_disable] }
- name: Re-enable MonARCH Lustre Cron Check
cron: name="Check dmesg for lustre errors" minute="*/5" job="/usr/local/sbin/check_lustre_dmesg.sh >> /tmp/check_lustre_output.txt 2>&1"
become: true
become_user: root
tags: [never, monarch_enable ]
- hosts: 'ManagementNodes'
gather_facts: false
tasks:
- name: prep a mgmt node for shutdown DO NOT FORGET TO LIMIT gluster needs 2 out of 3 to run
block:
# the failover actually works. but it only takes down the primary. so if this would be called from the backup all of slurm would go down
#- { name: force a failover shell: /opt/slurm-19.05.4/bin/scontrol takeover }
- { name: stop slurmdbd service, service: name=slurmdbd state=stopped }
- { name: stop slurmctld service, service: name=slurmctld state=stopped }
- { name: stop glusterd service, service: name=glusterd state=stopped }
- { name: stop glusterfsd service, service: name=glusterfsd state=stopped }
become: true
tags: [never,prepmgmtshutdown]
- name: verify a mgmt node came up well
block:
# TODO verify vdb is mounted
- { name: start glusterd service, service: name=glusterd state=started }
- { name: start glusterfsd service, service: name=glusterfsd state=started }
- { name: start slurmctld service, service: name=slurmctld state=started }
- { name: start slurmdbd service, service: name=slurmdbd state=started }
become: true
tags: [never,verifymgmtNode]
- hosts: 'SQLNodes'
gather_facts: false
tasks:
- name: prep a sqlnode node for shutdown
block:
- { name: stop mariadb service, service: name=mariadb state=stopped }
- { name: stop glusterd service, service: name=glusterd state=stopped }
- { name: stop glusterfsd service, service: name=glusterfsd state=stopped }
become: true
tags: [never,prepsqlshutdown]
- name: verify an sql node after a restart
block:
- { name: ensure mariadb service runs, service: name=mariadb state=started }
- { name: ensure glusterd service runs, service: name=glusterd state=started }
- { name: ensure glusterfsd service runs, service: name=glusterfsd state=started }
become: true
tags: [never,sqlverify]
- hosts: 'LoginNodes:!perfsonar01'
gather_facts: false
tasks:
- name: set nologin
block:
- include_vars: vars/slurm.yml
- { name: populate nologin file, shell: 'echo "{{ clustername }} is down for a scheduled maintenance." > /etc/nologin', become: true, become_user: root }
- { name: set attribute immutable so will not be deleted, shell: 'chattr +i /etc/nologin', become: true, become_user: root }
become: true
tags: [never,setnologin]
- name: remove nologin
block:
- { name: unset attribute immutable to allow deletion, shell: 'chattr -i /etc/nologin', become: true, become_user: root }
- { name: remove nologin file, file: path=/etc/nologin state=absent, become: true, become_user: root }
become: true
tags: [never,removenologin]
- name: terminate user ssh processes
block:
- { name: kill shells, shell: 'ps aux | grep -i bash | grep -v "ec2-user" | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root }
- { name: kill rsync sftp scp, shell: 'ps aux | egrep "sleep|sh|rsync|sftp|scp|sftp-server|sshd" | grep -v "ec2-user" | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root }
- { name: kill vscode, shell: 'pgrep -f vscode | xargs -I{} kill -09 {}', become: true, become_user: root, ignore_errors: true }
become: true
tags: [never,terminateusersshscprsync]
- hosts: 'LoginNodes,ComputeNodes,DGXRHELNodes,GlobusNodes'
gather_facts: false
tasks:
- name: stop lustre and disable service
block:
- { name: stop and disable lustre service, service: name=lustre-client enabled=False state=stopped }
become: true
tags: [never,stopdisablelustre]
- name: start lustre and enable service
block:
- { name: start and enable lustre service, service: name=lustre-client enabled=True state=started }
become: true
tags: [never,startenablelustre16Aug]
- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes'
gather_facts: false
tasks:
- { name: disable_lustre_service, service: name=lustre-client enabled=no, tags: [never,disable_lustre_service] }
- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes,ManagementNodes'
gather_facts: false
tasks:
- { name: umount /home, mount: path=/home state=unmounted, become: true, become_user: root, tags: [never,umount_home] }
#this should not really end up in the main branch but it does not hurt if it will
- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes,ManagementNodes'
gather_facts: false
tasks:
- { name: umount local-legacy, mount: path=/usr/local-legacy state=absent, become: true, become_user: root, tags: [never,umount_locallegacy] }
#!/bin/sh
#
#mount | grep gvfs | while read -r line ;
#do
# read -ra line_array <<< $line
# echo "umount ${line_array[2]}"
#done
#un-stuck yum
#mv /var/lib/rpm/__db* /tmp/
#mv /var/lib/rpm/.rpm.lock /tmp/
#mv /var/lib/rpm/.dbenv.lock /tmp
#yum clean all
#- hosts: 'all'
#gather_facts: false # not sure if false is clever here
#tasks:
#- include_vars: vars/ldapConfig.yml
#- include_vars: vars/filesystems.yml
#- include_vars: vars/slurm.yml
#- include_vars: vars/vars.yml
#- { name: set use shared state, set_fact: usesharedstatedir=False }
#tags: [ always ]
# this playbook is roughly sorted by
# - hostgroupstopics like ComputeNodes or ComputeNodes,LoginNodes, last VisNodes
# - "tag_groups" each starting after a #comment see #misc or misc tag
- hosts: 'ComputeNodes'
gather_facts: false
tasks:
# these are just templates.
#Note the tag never! Everything with never is only executed if called explicitly aka ansible-playbook --tags=foo,bar OR -tags=tag_group
- { name: template_shell, shell: ls, tags: [never,tag_group,uniquetag_foo] }
- { name: template_command, command: uname chdir=/bin, tags: [never,tag_group,uniquetag_bar] }
- { name: template_scipt, script: ./scripts/qa/test.sh, tags: [never,tag_group,uniquetag_script] }
#mpi stuff
- { name: run mpi on one computenode, command: ls, args: {chdir: "/tmp"} , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_local,TODO] }
- { name: run mpi on two computenode, command: ls, args: {chdir: "/tmp"} , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_local_two,TODO] }
#- { name: run mpi via sbatch, command: cmd=ls chdir="/tmp" , failed_when: "TODO is TRUE", tags: [never,mpi,slurm_mpi,TODO] }
#- { name: mpi_pinging, command: cmd=ls chdir="/tmp" , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_ping,TODO] }
#module load openmpi/3.1.6-ucx;mpirun --mca btl self --mca pml ucx -x UCX_TLS=mm -n 24 /projects/pMOSP/mpi/parallel_mandelbrot/parallel/mandelbrot
#module load openmpi/3.1.6-ucx;srun mpirun --mca btl self --mca pml ucx -x UCX_TLS=mm -n 24 /projects/pMOSP/mpi/parallel_mandelbrot/parallel/mandelbrot
#slurm
- { name: slurmd should be running, service: name=slurmd state=started, tags: [never,slurm,slurmd] }
- { name: munged should be running, service: name=munged state=started, tags: [never,slurm,munged] }
- { name: ensure connectivity to the controller, shell: scontrol ping, tags: [never,slurm,scontrol_ping] }
- { name: the most simple srun test, shell: srun --reservation=AWX hostname, tags: [never,slurm,srun_hostname] }
#nhc, manually run nhc because it contains many tests
- { name: run nhc explicitly, command: /opt/nhc-1.4.2/sbin/nhc -c /opt/nhc-1.4.2/etc/nhc/nhc.conf, become: true , tags: [never,slurm,nhc] }
# networking
- { name: ping license server, shell: ls, tags: [never,network,ping_license] }
- { name: ping something outside monash, command: ping -c 1 8.8.8.8, tags: [never,network,ping_external] }
#mounts
- hosts: 'ComputeNodes,LoginNodes'
gather_facts: false
tasks:
- { name: check mount for usr_local, shell: "mount | grep -q local", tags: [never,mountpoints,mountpoints_local] }
- { name: check mount for projects, shell: "lfs df -h", tags: [never,mountpoints_projects] }
- { name: check mount for home, shell: "mount | grep -q home", tags: [never,mountpoints,mountpoints_home] }
- { name: check mount for scratch, shell: "mount | grep -q scratch" , tags: [never,mountpoints_scratch] }
#misc
- { name: check singularity, shell: module load octave && octave --version, tags: [never,misc,singularity3] }
- { name: module test, shell: cmd="module load gcc" executable="/bin/bash", tags: [never,misc,modulecmd] }
- { name: contact ldap, shell: maybe test ldapsearch, failed_when: "TODO is TRUE", tags: [never,misc,ldap,TODO] }
#gpu
- hosts: 'VisNodes'
gather_facts: false
tasks:
- { name: run nvida-smi to see if a gpu driver is present, command: "/bin/nvidia-smi", tags: [never,gpu,smi] }
- { name: run gpu burn defaults to 30 seconds, command: "/usr/local/gpu_burn/1.0/run_silent.sh", tags: [never,gpu,long,gpuburn] }
# extended time-consuming tests
# relion see https://docs.massive.org.au/communities/cryo-em/tuning/tuning.html
# linpack
#module load openmpi/1.10.7-mlx;ldd /usr/local/openmpi/1.10.7-mlx/bin/* | grep -ic found
#!/usr/bin/python
import subprocess
import sys
def getTime():
print "How long do you think you need this computer for?"
print "If you need the computer for 2 days and 12 hours please enter as 2-12 or 2-12:00:00"
time=sys.stdin.readline().strip()
try:
(days,hours)=time.split('-')
except:
days=0
hours=time
try:
(hours,minues) = time.split(':')
except:
pass
return (days,hours)
def getNCPUs():
print "How many CPUs would you like?"
cpus=None
while cpus==None:
cpustr=sys.stdin.readline().strip()
try:
cpus=int(cpustr)
except:
print "Sorry I can't interpret %s as a number"%cpustr
print "How many CPUs would you like?"
return cpus
def getRAM():
print "How much RAM would you like (press enter for the default)?"
ramstr= sys.stdin.readline().strip()
while ramstr!=None and ramstr!="":
try:
ram=int(ramstr)
return ram
except:
print "Sorry I can't interpret %s as a number"%ramstr
print "How much RAM would you like?"
ramstr= sys.stdin.readline()
return None
def subjob(time,cpus,ram):
if ram==None:
ram=cpus*2000
import subprocess
scriptpath='/home/chines'
p=subprocess.Popen(['sbatch','--time=%s-%s'%(time[0],time[1]),'--nodes=1','--mincpu=%s'%cpus,'--mem=%s'%ram,'%s/mbpjob.sh'%scriptpath],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
(stdout,stderr)=p.communicate()
import re
m=re.match('Submitted batch job (?P<jobid>[0-9]+)',stdout)
if m:
return m.groupdict()['jobid']
def isState(jobid,state='RUNNING'):
import re
p=subprocess.Popen(['scontrol','show','job','-d',jobid],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
(stdout,stderr)=p.communicate()
jobidre=re.compile('JobId=(?P<jobid>[0-9]+)\s')
statere=re.compile('^\s+JobState=(?P<state>\S+)\s')
currentjobid=None
for l in stdout.splitlines():
m=jobidre.match(l)
if m:
currentjobid=m.groupdict()['jobid']
m=statere.match(l)
if m:
if m.groupdict()['state']==state:
if jobid==currentjobid:
return True
else:
if jobid==currentjobid:
return False
return False
def waitjob(jobid):
import time
while True:
if isState(jobid,'RUNNING'):
return
else:
print "job %s not running"%jobid
time.sleep(1)
def listJobs():
import re
r=[]
user = subprocess.check_output(['whoami'])
jobs = subprocess.check_output(['squeue','-u',user,'-h','-o','"%i %L %j %c"'])
jobre=re.compile("(?P<jobid>(?P<jobidNumber>[0-9]+)) (?P<time>\S+ (?P<jobname>\S+) (?P<cpus>[0-9]+))$"
for l in jobs.splitlines():
m=jobidre.search(l)
if m:
r.append(m.groupdict())
return r
def getNode(jobid):
import re
stdout=subprocess.check_output(['scontrol','show','job','-d',jobid])
for l in stdout.splitlines():
m=re.search('^\s+Nodes=(?P<nodelist>\S+)\s',l)
if m:
nodes=m.groupdict()['nodelist'].split(',')
return nodes[0]
def createJob(*args,**kwargs):
time=getTime()
#cpus=getNCPUs()
cpus=1
#ram=getRAM()
ram=None
subjob(time,cpus,ram)
def selectJob(jobidlist):
if len(jobidlist)==1:
return jobidlist[0]['jobid']
else:
print "Please select a job (or press enter to cancel)"
i=1
print "\tJob name\tNum CPUs\tRemaining Time"
for j in jobidlist:
print "%s\t%s\t%s\t%s"%(i,j['jobname'],j['numcpus'],j['time'])
try:
jobnum=int(sys.stdin.readline().strip())
if (jobnum>0 and jobnum<=jobidlist):
return jobidlist[jobnum-1]['jobid']
except:
pass
return None
def connect(*args,**kwargs):
jobidlist=listJobs()
jobid=selectJob(jobidlist)
if jobid!=None:
waitjob(jobid)
node=getNode(jobid)
print node
def stop(*args,**kwargs):
jobidlist=listJobs()
jobid=selectJob(jobidlist)
if jobid!=None:
stopjob(jobid)
def main():
import argparse
parser = argparse.ArgumentParser()
subparser = parser.add_subparsers()
start = subparser.add_parser('start', help='alloate a node to the user')
start.set_defaults(func=createJob)
connect = subparser.add_parser('connect')
start.set_defaults(func=connect)
stop = subparser.add_parser('stop')
start.set_defaults(func=stop)
args = parser.parse_args()
args.func(args)
try:
jobidlist=listJobs()
if len(jobidlist)>1:
print "cancel all jobs here"
jobidlist=listJobs()
if len(jobidlist)==0:
time=getTime()
#cpus=getNCPUs()
cpus=1
#ram=getRAM()
ram=None
subjob(time,cpus,ram)
jobidlist=listJobs()
if len(jobidlist)==1:
jobid=jobidlist[0]
waitjob(jobid)
node=getNode(jobid)
print node
sys.exit(0)
except Exception as e:
print e
import traceback
print traceback.format_exc()
sys.exit(1)
main()
#!/bin/bash
mpbctrl='/home/hines/mbp_script/get_node.py'
node=$( $mbpctrl $1 )
if [[ $node ]]; then
ssh -t $node tmux attach-session
fi
---
- name: make sure /usr/local/bin exists
file: path=/usr/local/bin state=directory mode=755 owner=root
become: true
- name: install get_node.py
copy: src=get_node.py dest=/usr/local/bin/get_node.py mode=755 owner=root
become: true
- name: install mbp_node
copy: src=mbp_node dest=/usr/local/bin/mbp_node mode=755 owner=root
become: true
---
# This role is to fix a misconfiguration of some OpenStack Base images at Monash University.
# the misconfiguration is dev/vdb mounted in fstab of the Image and the Openstack Flavour not providing a second disk.
- name: unmount vdb if absent
mount:
path: "/mnt"
src: "/dev/vdb"
state: absent
become: true
when: 'hostvars[inventory_hostname]["ansible_devices"]["vdb"] is not defined'
- name: keep mnt present
file:
path: "/mnt"
owner: root
group: root
mode: "u=rwx,g=rx,o=rx"
state: directory
become: true
when: 'hostvars[inventory_hostname]["ansible_devices"]["vdb"] is not defined'