Compare revisions

cae6bd7b · cae6bd7b · cae6bd7b · d1bd20ff · d1bd20ff · cae6bd7b
--- a/CICD/vars/versions.yml
+++ b/CICD/vars/versions.yml
+nhc_version: 1.4.2
+nhc_src_url: https://codeload.github.com/mej/nhc/tar.gz/refs/tags/1.4.2
+nhc_src_checksum: "sha1:766762d2c8cd81204b92d4921fb5b66616351412"
+nhc_src_dir: /opt/src/nhc-1.4.2
+nhc_dir: /opt/nhc-1.4.2
+
+slurm_version: 21.08.8
+slurm_src_url: https://download.schedmd.com/slurm/slurm-21.08.8.tar.bz2
+slurm_src_checksum: "sha1:7d37dbef37b25264a1593ef2057bc423e4a89e81"
+
+slurm_version: 22.05.3
+slurm_src_url: https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2
+slurm_src_checksum: "sha1:55e9a1a1d2ddb67b119c2900982c908ba2846c1e"
+
+slurm_src_dir: /opt/src/slurm-{{ slurm_version }}
+slurm_dir: /opt/slurm-{{ slurm_version }}
+
+ucx_version: 1.8.0
+ucx_src_url: https://github.com/openucx/ucx/releases/download/v1.8.0/ucx-1.8.0.tar.gz
+ucx_src_checksum: "sha1:96f2fe1918127edadcf5b195b6532da1da3a74fa"
+ucx_src_dir: /opt/src/ucx-1.8.0
+ucx_dir: /opt/ucx-1.8.0
+
+munge_version: 0.5.14
+munge_src_url: https://github.com/dun/munge/archive/refs/tags/munge-0.5.14.tar.gz
+munge_src_checksum: "sha1:70f6062b696c6d4f17b1d3bdc47c3f5eca24757c"
+munge_dir: /opt/munge-0.5.14
+munge_src_dir: /opt/src/munge-munge-0.5.14
+
+nvidia_mig_parted_version: 0.1.3
+nvidia_mig_parted_src_url: https://github.com/NVIDIA/mig-parted/archive/refs/tags/v0.1.3.tar.gz
+nvidia_mig_parted_src_checksum: "sha1:50597b4a94348c3d52b3234bb22783fa236f1d53"
+nvidia_mig_parted_src_dir: /opt/src/mig-parted-0.1.3
+
+nvidia_mig_slurm_discovery_version: master
+nvidia_mig_slurm_discovery_src_url: https://gitlab.com/nvidia/hpc/slurm-mig-discovery.git
+nvidia_mig_slurm_discovery_src_dir: /opt/src/mig-slurm_discovery
+
+nvidia_cuda_version: cuda
+nvidia_libcudnn_version: libcudnn8-dev
--- a/License
+++ b/License
--- a/README.md
+++ b/README.md
-ansible_cluster_in_a_box
-========================
+**HPCasCode**
+=============

-The aim of this repo is to provide a set or ansible roles that can be used to deploy a cluster
-We are working from 
-https://docs.google.com/a/monash.edu/spreadsheets/d/1IZNE7vMid_SHYxImGVtQcNUiUIrs_Nu1xqolyblr0AE/edit#gid=0
-as our architecture document.
+[![pipeline status](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/badges/cicd/pipeline.svg)](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/commits/cicd) [Issues](https://gitlab.erc.monash.edu.au/hpc-team/ansible_cluster_in_a_box/-/issues)

-We aim to make these roles as generic as possible. You should be able to start from an inventory file, an ssh key and a git clone of this and end up with a working cluster. In the longer term we might branch to include utilities to make an inventory file using NeCTAR credentials.
+1. [ Introduction / Purpose ](#Introduction)
+2. [ Getting started ](#gettingstarted)
+3. [ Features ](#Features)
+4. [ Assumptions ](#Assumptions)
+5. [ Configuration ](#Configuration)
+6. [ Contribute and Collaborate  ](#Contribute)
+7. [ Used by ](#partners)
+8. [ CICD Coverage ](#coverage)
+9. [ Roadmap ](#Roadmap)

-If you need a password use get_or_make_password.py (delegated to the passwword server/localhost) to generate a random one that can be shared between nodes
-Here is an example task (taken from setting up karaage):
- name: mysql db
-  mysql_db: name=karaage login_user=root login_password={{ sqlrootPasswd.stdout }}
+<a name="Introduction"></a>
+## Introduction / Purpose TODO

- name: karaage sql password
-  shell: ~/get_or_make_passwd.py karaageSQL
-  delegate_to: 127.0.0.1
-  register: karaageSqlPassword
+The purpose of this repository is to deploy a **H**igh **P**erformance **C**omputing [HPC] Systems using Infrastructure-as-Code [IaC] principles predominantly using ansible. The aim is to follow the principles of InfrastructureAsCode and repurpose the principles for HPC Systems. 

- name: mysql user
-  mysql_user: name='karaage' password={{ item }} priv=karaage.*:ALL state=present login_user=root login_password={{ sqlrootPasswd.stdout }}
-  with_items: karaageSqlPassword.stdout
+By encoding the system state the following advantages are gained which also define the values of this project: 

+- **collaboration** it is simply easier to share code than systems. This repository also aims to serve as a place for  backup and discussions. And in the near future even documentation :)
+- **redeployability** if a system follows the same build recipes and the deployment is automated all installations should be similar even Test and Production. You will also hear the term Immutable infrastructure for this.
+<!--- Just some suggestion:
+# if a system follows the same build recipes`,` and the deployment is automated, all installations should be similar `including` Test and Production. We would also use the term Immutable Infrastructure to describe this state. -->
+- **CICD automation** aka Test- and change automation allow us to put safegard in place increasing our quality and also to ease the burden of change management.
+<!--- - **CICD automation and change automation** it allow us to put `safeguard` in place`,` increasing our quality and `easing` the burden of change management. -->
+- **modular and reusable**  For example we currently support two Operation system and also Baremetal and openstack based deployment. AWS support is also in scope. 
+<!--- # - **modular and reusable**  it allows us to support two operation such as two operating systems, and also Baremetal and openstack based deployment. AWS support is also in scope. -->
+The scope if this repository ranges from an ansible inventory ( although openstack and later AWS will be supported ) and ends with a gpu powered desktop provisioning system ontop of the ressource scheduler slurm. Services for example an identity system or package repository management are in scope but future work. 
+<!--- The scope of this repository includes an ansible inventory with Openstack and AWS support to come, as well as GPU-desktop provisioning on top of Slurm workload manager. Services such as identify system and package repository are also in scope for future work. -->

-We aim to make these roles run on all common linux platforms (both RedHat and Debian derived) but at the very least they should work on a CentOS 6 install.
+<a name="gettingstarted"></a>
+## Getting Started TODO
+Ok all CAPS incoming. TAKE CARE OF CICD/vars/passwords.yml best case, use ansible vault at the same time you start modifying that file.
+Given the current state of documentation please feel free to reach out and ask for clarification. This project or repository is far from done or being perfect!

-Yaml syntax can be checked at http://www.yamllint.com/
+<a name="Features"></a>
+## Features
+- Currently supports Centos 7, Centos 8 and Ubuntu 1804
+- CICD tested including spawning a cluster, MPI and slurm tests
+- Rolling node updates is currently work in progress. 
+- Coming up: strudel2 desktop integration
+
+<a name="Assumptions"></a>
+## Assumptions
+- The ansible inventory file needs to be in the following format: nodetypes 1x[SQLNodes], 1x[NFSNodes], 2x[ManagementNodes], >=0 LoginNodes, >=0 ComputeNodes, >=0 VisNodes ( GPU Nodes ) . SQL and NFS can be combined. The ManagementNodes are managing slurm access and job submission. Here is an [example](docs/sampleinventory.yml)
+- The filesystem layout assumes the following shared drives /home, /projects for project data, /scratch for fast compute, /usr/local for provided software see [here](CICD/files/etcExports)
+- TODO rewrite a populated vars folder as in the CICD subfolder. See configuration for any details and apologies
+- The software stack is currently provided as a mounted shared drive. Software can be loaded via Environment-Modules. Containerisation is also heavily in use. Please reach out for further details and sharing requests.
+- Networking is partially defined on this level, and partially on the Baremetal level. This is for now a point to reach out and discuss. Current implementations support public facing LoginNodes and the rest being in a private Network with a NAT-Gateway. OR a private Infiniband network in conjunction with a 1G network. 
+- a shared ssh key for a management user
+
+<a name="Configuration"></a>
+## Configuration
+Configuration is defined in the variables (vars) in the vars folder. See CICD/vars.  These vars are evolving over time with the necessary refactoring and abstraction. These variables have `definitely` grown over time when this set of ansible roles was developed, first for a single system and then to multiple sytems. The best way to interpret their usage currently is to search or `grep VARIABLENAME` this repository and see how they are used. This is not pretty for external use. Lets not even pretend it is.
+- filesystems.yml covers filesystem types and exports/mounts
+- ldapConfig.yml covers identity via LDAP
+- names.yml to store the domain name
+- passwords.yml containing passwords in a single file to be encrypted via ansible-vault
+- slurm.yml containing all slurm variables
+- vars.yml contains variables which don't belong somewhere else
+
+<a name="Contribute"></a>
+## How do I contribute or collaborate
+- Get in contact, use the issue tracker and if you want to contribute documentation or code or anything else we offer handholding for the first merge request. 
+- A great first start is to get in contact, tell us what you want to know and help us improve the documentation
+- Please contact us via andreas.hamacher(at)monash.edu or help(at)massive.org.au 
+- Contribution guidelines would also be a good contribution :)
+ 
+
+<a name="partners"></a>
+## Used by:
+![Monash University](docs/images/monash-university-logo.png "monash.edu")
+![MASSIVE](docs/images/massive-website-banner.png "massive.org.au")
+![Australian Research Data Commons](docs/images/ardc.png "ardc.edu.au")
+![University of Western Australia](docs/images/university-of-western-australia-logo.png "uwa.edu.au")
+
+<a name="Coverage"></a>
+## CI Coverage
+- Centos7.8, Centos8, Ubuntu1804
+- All node types as outlined in Assumptions
+- vars for Massive, Monarch and a Generic Cluster ( see files in CICD/vars )  
+- CD in progress using autoupdate.py
+
+<a name="Roadmap"></a>
+## Roadmap
+- soon this section is to be moved into the Milestones on gitlab.
+- Desktop integration using Strudel2. Contributors are welcome to integrate OpenOnDemand
+- CVL integration see github.com/Characterisation-Virtual-Laboratory
+- Automated Security Checks as part of the CICD pipeline
+- Integration of a FOSS identity system currently only a token LDAP is supported
+- System status monitoring and alerting
+
+
+[Nice read titled Infrastructure as Code DevOps principle: meaning, benefits, use cases](https://medium.com/@FedakV/infrastructure-as-code-devops-principle-meaning-benefits-use-cases-a4461a1fef2)

--- a/buildCert.yml
+++ b/buildCert.yml
--- 
- name: "Check client ca certificate"
-  register: ca_cert
-  stat: "path={{ x509_cacert_file }}"
-
- name: "Check certificate and key"
-  shell: (openssl x509 -noout -modulus -in {{ x509_cert_file }}  | openssl md5 ; openssl rsa -noout -modulus -in {{ x509_key_file }} | openssl md5) | uniq | wc -l
-  register: certcheck
-
- name: "Check certificate"
-  register: cert
-  stat: "path={{ x509_cert_file }}"
-
- name: "Check key"
-  register: key
-  stat: "path={{ x509_key_file }}"
-  sudo: true
-
- name: "Default: we don't need a new certificate"
-  set_fact: needcert=False
-
- name: "Set need cert if key is missing"
-  set_fact: needcert=True
-  when: key.stat.exists == false
-
- name: "set needcert if cert is missing"
-  set_fact: needcert=True
-  when: cert.stat.exists == false
-
- name: "set needcert if cert doesn't match key"
-  set_fact: needcert=True
-  when: certcheck.stdout == '2'
-
-
- name: "Creating Keypair"
-  shell: "echo noop when using easy-rsa"
-  when: needcert
-
- name: "Creating CSR"
-  shell: " cd /etc/easy-rsa/2.0; source ./vars; export EASY_RSA=\"${EASY_RSA:-.}\"; \"$EASY_RSA\"/pkitool --csr {{ x509_csr_args }} {{ common_name }}"
-  when: needcert
-  sudo: true
-
- name: "Copy CSR to ansible host"
-  fetch: "src=/etc/easy-rsa/2.0/keys/{{ common_name }}.csr dest=/tmp/{{ common_name }}/ fail_on_missing=yes validate_md5=yes flat=yes"
-  sudo: true
-  when: needcert
-
- name: "Copy CSR to CA"
-  delegate_to: "{{ x509_ca_server }}"
-  copy: "src=/tmp/{{ ansible_fqdn }}/{{ common_name }}.csr dest=/etc/easy-rsa/2.0/keys/{{ common_name }}.csr force=yes"
-  when: needcert
-  sudo: true
-
- name: "Sign Certificate"
-  delegate_to: "{{ x509_ca_server }}"
-  shell:    "source ./vars; export EASY_RSA=\"${EASY_RSA:-.}\" ;\"$EASY_RSA\"/pkitool --sign {{ common_name }}"
-  args:
-    chdir: "/etc/easy-rsa/2.0"
-  sudo: true
-  when: needcert
-
- name: "Copy the Certificate to ansible host"
-  delegate_to: "{{ x509_ca_server }}"
-  fetch: "src=/etc/easy-rsa/2.0/keys/{{ common_name }}.crt dest=/tmp/{{ common_name }}/ fail_on_missing=yes validate_md5=yes flat=yes"
-  sudo: true
-  when: needcert
-
- name: "Copy the CA Certificate to the ansible host"
-  delegate_to: "{{ x509_ca_server }}"
-  fetch: "src=/etc/easy-rsa/2.0/keys/ca.crt dest=/tmp/ca.crt fail_on_missing=yes validate_md5=yes flat=yes"
-  sudo: true
-  when: "ca_cert.stat.exists == false"
-
- name: "Copy the certificate to the node"
-  copy: "src=/tmp/{{ common_name }}/{{ common_name }}.crt dest={{ x509_cert_file }} force=yes"
-  sudo: true
-  when: needcert
-
- name: "Copy the CA certificate to the node"
-  copy: "src=/tmp/ca.crt dest={{ x509_cacert_file }}"
-  sudo: true
-  when: "ca_cert.stat.exists == false"
-
- name: "Copy the key to the correct location"
-  shell: "mkdir -p `dirname {{ x509_key_file }}` ; chmod 700 `dirname {{ x509_key_file }}` ; cp /etc/easy-rsa/2.0/keys/{{ common_name }}.key {{ x509_key_file }}"
-  sudo: true
-  when: needcert
--- a/createNode
+++ b/createNode
-#!/usr/bin/env python
-import sys, os, string, subprocess, socket, ansible.runner, re
-import copy, shlex,uuid, random, multiprocessing, time, shutil
-import novaclient.v1_1.client as nvclient
-import novaclient.exceptions as nvexceptions
-import glanceclient.v2.client as glclient
-import keystoneclient.v2_0.client as ksclient
-
-class Authenticate:
-	
-	def __init__(self, username, passwd):
-		self.username=username
-		self.passwd=passwd
-		self.tenantName= os.environ['OS_TENANT_NAME']
-		self.authUrl="https://keystone.rc.nectar.org.au:5000/v2.0"
-		kc = ksclient.Client(   auth_url=self.authUrl,
-					username=self.username,
-					password=self.passwd)
-		self.tenantList=kc.tenants.list()
-		self.novaSemaphore = multiprocessing.BoundedSemaphore(value=1)
-	
-	def createNovaObject(self,tenantName):
-		for tenant in self.tenantList:
-			if tenant.name == tenantName:
-				try:
-					nc = nvclient.Client(	auth_url=self.authUrl,
-						username=self.username,
-						api_key=self.passwd,
-						project_id=tenant.name,
-						tenant_id=tenant.id,
-						service_type="compute"
-						)
-					return nc
-				except nvexceptions.ClientException:
-					raise
-	
-	def gatherInfo(self):
-
-		for tenant in self.tenantList: print tenant.name
-		tenantName = raw_input("Please select a project: (Default MCC-On-R@CMON):")
-		if not tenantName or tenantName not in [tenant.name for tenant in self.tenantList]: 
-			tenantName = "MCC_On_R@CMON"
-		print tenantName,"selected\n"
-		
-		## Fetch the Nova Object
-
-		nc = self.createNovaObject(tenantName)
-		
-		## Get the Flavor
-		flavorList = nc.flavors.list()
-		for flavor in flavorList: print flavor.name
-		flavorName = raw_input("Please select a Flavor Name: (Default m1.xxlarge):")
-		if not flavorName or flavorName not in [flavor.name for flavor in flavorList]:
-			flavorName = "m1.xxlarge"
-		print flavorName,"selected\n"
-
-		
-		## Get the Availability Zones
-		az_p1 = subprocess.Popen(shlex.split\
-		("nova availability-zone-list"),stdout=subprocess.PIPE)
-		az_p2 = subprocess.Popen(shlex.split\
-		("""awk '{if ($2 && $2 != "Name")print $2}'"""),\
-		stdin=az_p1.stdout,stdout=subprocess.PIPE)
-		availabilityZonesList =  subprocess.Popen(shlex.split\
-		("sort"),stdin=az_p2.stdout,stdout=subprocess.PIPE).communicate()[0]
-		print  availabilityZonesList
-		availabilityZone = raw_input("Please select an availability zone: (Default monash-01):")
-		if not availabilityZone or \
-		availabilityZone not in [ zone for zone in availabilityZonesList.split()]:
-			availabilityZone = "monash-01"
-		print availabilityZone,"selected\n"
-		
-		## Get the number of instances to spawn
-		numberOfInstances = raw_input\
-		("Please specify the number of instances to launch: (Default 1):")
-		if not numberOfInstances or \
-		not isinstance(int(numberOfInstances), int):
-			numberOfInstances = 1
-		subprocess.call(['clear'])
-		flavorObj = nc.flavors.find(name=flavorName)
-		print "Creating",numberOfInstances,\
-		"instance(s) in",availabilityZone,"zone..."
-		instanceList = []
-		for counter in range(0,int(numberOfInstances)):
-			nodeName = "MCC-Node"+str(random.randrange(1,1000))
-			try:
-				novaInstance =  nc.servers.create\
-				(name=nodeName,image="ddc13ccd-483c-4f5d-a5fb-4b968aaf385b",\
-				flavor=flavorObj,key_name="shahaan",\
-				availability_zone=availabilityZone)
-				instanceList.append(novaInstance)
-			except nvexceptions.ClientException:
-				raise
-				continue
-				
-		while 'BUILD' in [novaInstance.status \
-		for novaInstance in instanceList]:
-			for count in range(0,len(instanceList)):
-				time.sleep(5)
-				if instanceList[count].status != 'BUILD': 
-					continue
-				else:
-					try:
-						instanceList[count] = nc.servers.get(instanceList[count].id)
-					except nvexceptions.ClientException or \
-					nvexceptions.ConnectionRefused or \
-					nvexceptions.InstanceInErrorState:
-						raise
-						del instanceList[count]
-						continue
-		activeHostsList = []
-		SSHports = []
-		for novaInstance in instanceList:
-			if novaInstance.status == 'ACTIVE':
-				hostname = socket.gethostbyaddr(novaInstance.networks.values()[0][0])[0]
-				activeHostsList.append(hostname)
-				SSHDict = {}
-				SSHDict['IP'] = novaInstance.networks.values()[0][0]
-				SSHDict['status'] = 'CLOSED'
-				SSHports.append(SSHDict) 
-		print "Scanning if port 22 is open..."
-		sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-		while 'CLOSED' in [host['status'] for host in SSHports]:
-			for instance in range(0,len(SSHports)):
-				sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-				if SSHports[instance]['status'] == 'CLOSED' and not sock.connect_ex((SSHports[instance]['IP'], 22)):
-					SSHports[instance]['status'] = 'OPEN'
-					print "Port 22, opened for IP:",SSHports[instance]['IP']
-				else:
-					time.sleep(5)
-				sock.close()
-				
-		fr = open('/etc/ansible/hosts.rpmsave','r+')
-		fw = open('hosts.temp','w+')
-		lines = fr.readlines()
-		for line in lines:
-			fw.write(line)
-			if re.search('\[new-servers\]',line):
-				for host in activeHostsList: fw.write(host+'\n')
-		fr.close()
-		fw.close()
-		shutil.move('hosts.temp','/etc/ansible/hosts')
-		print "Building the Nodes now..."
-		subprocess.call(shlex.split("/mnt/nectar-nfs/root/swStack/ansible/bin/ansible-playbook /mnt/nectar-nfs/root/ansible-config-root/mcc-nectar-dev/buildNew.yml -v"))	
-
-if __name__ == "__main__":
-	username = os.environ['OS_USERNAME']
-	passwd = os.environ['OS_PASSWORD']
-	choice = raw_input(username + " ? (y/n):")
-	while choice and choice not in ("n","y"):
-		print "y or n please"
-		choice = raw_input()
-	if choice == "n":
-		username = raw_input("username :")
-		passwd = raw_input("password :")
-	auth = Authenticate(username, passwd)
-	auth.gatherInfo()
--- a/docs/ChordDiagramm/Chord_Diagramm - Sheet1.csv
+++ b/docs/ChordDiagramm/Chord_Diagramm - Sheet1.csv
+0,0,0,1,1,1,1,1,1,0
+0,0,0,0,0,0,1,1,0,0
+0,0,0,0,1,1,1,1,0,0
+1,0,0,0,0,0,0,1,0,0
+1,0,1,0,0,0,0,1,0,0
+1,0,1,0,0,0,0,1,0,0
+1,1,1,0,0,0,0,1,1,0
+1,1,1,1,1,1,1,0,0,0
+1,0,0,0,0,0,1,0,0,0
+0,0,0,0,0,0,0,0,0,0
\ No newline at end of file
--- a/docs/ChordDiagramm/Chord_Diagramm.png
+++ b/docs/ChordDiagramm/Chord_Diagramm.png
--- a/docs/ChordDiagramm/genChordDiagramm.py
+++ b/docs/ChordDiagramm/genChordDiagramm.py
+#!/usr/bin/env python3
+# script copied from https://github.com/fengwangPhysics/matplotlib-chord-diagram/blob/master/README.md
+# source data manually edited via https://docs.google.com/spreadsheets/d/1JN9S_A5ICPQOvgyVbWJSFJiw-5gO2vF-4AeYuWl-lbs/edit#gid=0
+# chord diagram
+import matplotlib.pyplot as plt
+from matplotlib.path import Path
+import matplotlib.patches as patches
+
+import numpy as np
+
+LW = 0.3
+
+def polar2xy(r, theta):
+    return np.array([r*np.cos(theta), r*np.sin(theta)])
+
+def hex2rgb(c):
+    return tuple(int(c[i:i+2], 16)/256.0 for i in (1, 3 ,5))
+
+def IdeogramArc(start=0, end=60, radius=1.0, width=0.2, ax=None, color=(1,0,0)):
+    # start, end should be in [0, 360)
+    if start > end:
+        start, end = end, start
+    start *= np.pi/180.
+    end *= np.pi/180.
+    # optimal distance to the control points
+    # https://stackoverflow.com/questions/1734745/how-to-create-circle-with-b%C3%A9zier-curves
+    opt = 4./3. * np.tan((end-start)/ 4.) * radius
+    inner = radius*(1-width)
+    verts = [
+        polar2xy(radius, start),
+        polar2xy(radius, start) + polar2xy(opt, start+0.5*np.pi),
+        polar2xy(radius, end) + polar2xy(opt, end-0.5*np.pi),
+        polar2xy(radius, end),
+        polar2xy(inner, end),
+        polar2xy(inner, end) + polar2xy(opt*(1-width), end-0.5*np.pi),
+        polar2xy(inner, start) + polar2xy(opt*(1-width), start+0.5*np.pi),
+        polar2xy(inner, start),
+        polar2xy(radius, start),
+        ]
+
+    codes = [Path.MOVETO,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.LINETO,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CLOSEPOLY,
+             ]
+
+    if ax == None:
+        return verts, codes
+    else:
+        path = Path(verts, codes)
+        patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
+        ax.add_patch(patch)
+
+
+def ChordArc(start1=0, end1=60, start2=180, end2=240, radius=1.0, chordwidth=0.7, ax=None, color=(1,0,0)):
+    # start, end should be in [0, 360)
+    if start1 > end1:
+        start1, end1 = end1, start1
+    if start2 > end2:
+        start2, end2 = end2, start2
+    start1 *= np.pi/180.
+    end1 *= np.pi/180.
+    start2 *= np.pi/180.
+    end2 *= np.pi/180.
+    opt1 = 4./3. * np.tan((end1-start1)/ 4.) * radius
+    opt2 = 4./3. * np.tan((end2-start2)/ 4.) * radius
+    rchord = radius * (1-chordwidth)
+    verts = [
+        polar2xy(radius, start1),
+        polar2xy(radius, start1) + polar2xy(opt1, start1+0.5*np.pi),
+        polar2xy(radius, end1) + polar2xy(opt1, end1-0.5*np.pi),
+        polar2xy(radius, end1),
+        polar2xy(rchord, end1),
+        polar2xy(rchord, start2),
+        polar2xy(radius, start2),
+        polar2xy(radius, start2) + polar2xy(opt2, start2+0.5*np.pi),
+        polar2xy(radius, end2) + polar2xy(opt2, end2-0.5*np.pi),
+        polar2xy(radius, end2),
+        polar2xy(rchord, end2),
+        polar2xy(rchord, start1),
+        polar2xy(radius, start1),
+        ]
+
+    codes = [Path.MOVETO,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             ]
+
+    if ax == None:
+        return verts, codes
+    else:
+        path = Path(verts, codes)
+        patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
+        ax.add_patch(patch)
+
+def selfChordArc(start=0, end=60, radius=1.0, chordwidth=0.7, ax=None, color=(1,0,0)):
+    # start, end should be in [0, 360)
+    if start > end:
+        start, end = end, start
+    start *= np.pi/180.
+    end *= np.pi/180.
+    opt = 4./3. * np.tan((end-start)/ 4.) * radius
+    rchord = radius * (1-chordwidth)
+    verts = [
+        polar2xy(radius, start),
+        polar2xy(radius, start) + polar2xy(opt, start+0.5*np.pi),
+        polar2xy(radius, end) + polar2xy(opt, end-0.5*np.pi),
+        polar2xy(radius, end),
+        polar2xy(rchord, end),
+        polar2xy(rchord, start),
+        polar2xy(radius, start),
+        ]
+
+    codes = [Path.MOVETO,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             Path.CURVE4,
+             ]
+
+    if ax == None:
+        return verts, codes
+    else:
+        path = Path(verts, codes)
+        patch = patches.PathPatch(path, facecolor=color+(0.5,), edgecolor=color+(0.4,), lw=LW)
+        ax.add_patch(patch)
+
+def chordDiagram(X, ax, colors=None, width=0.1, pad=2, chordwidth=0.7):
+    """Plot a chord diagram
+    Parameters
+    ----------
+    X :
+        flux data, X[i, j] is the flux from i to j
+    ax :
+        matplotlib `axes` to show the plot
+    colors : optional
+        user defined colors in rgb format. Use function hex2rgb() to convert hex color to rgb color. Default: d3.js category10
+    width : optional
+        width/thickness of the ideogram arc
+    pad : optional
+        gap pad between two neighboring ideogram arcs, unit: degree, default: 2 degree
+    chordwidth : optional
+        position of the control points for the chords, controlling the shape of the chords
+    """
+    # X[i, j]:  i -> j
+    x = X.sum(axis = 1) # sum over rows
+    ax.set_xlim(-1.1, 1.1)
+    ax.set_ylim(-1.1, 1.1)
+
+    if colors is None:
+    # use d3.js category10 https://github.com/d3/d3-3.x-api-reference/blob/master/Ordinal-Scales.md#category10
+        colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
+                  '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf', '#c49c94']
+        if len(x) > len(colors):
+            print('x is too large! Use x smaller than 11')
+        colors = [hex2rgb(colors[i]) for i in range(len(x))]
+
+    # find position for each start and end
+    y = x/np.sum(x).astype(float) * (360 - pad*len(x))
+
+    pos = {}
+    arc = []
+    nodePos = []
+    start = 0
+    for i in range(len(x)):
+        end = start + y[i]
+        arc.append((start, end))
+        angle = 0.5*(start+end)
+        #print(start, end, angle)
+        if -30 <= angle <= 210:
+            angle -= 90
+        else:
+            angle -= 270
+        nodePos.append(tuple(polar2xy(1.1, 0.5*(start+end)*np.pi/180.)) + (angle,))
+        z = (X[i, :]/x[i].astype(float)) * (end - start)
+        ids = np.argsort(z)
+        z0 = start
+        for j in ids:
+            pos[(i, j)] = (z0, z0+z[j])
+            z0 += z[j]
+        start = end + pad
+
+    for i in range(len(x)):
+        start, end = arc[i]
+        IdeogramArc(start=start, end=end, radius=1.0, ax=ax, color=colors[i], width=width)
+        start, end = pos[(i,i)]
+        selfChordArc(start, end, radius=1.-width, color=colors[i], chordwidth=chordwidth*0.7, ax=ax)
+        for j in range(i):
+            color = colors[i]
+            if X[i, j] > X[j, i]:
+                color = colors[j]
+            start1, end1 = pos[(i,j)]
+            start2, end2 = pos[(j,i)]
+            ChordArc(start1, end1, start2, end2,
+                     radius=1.-width, color=colors[i], chordwidth=chordwidth, ax=ax)
+
+    #print(nodePos)
+    return nodePos
+
+##################################
+if __name__ == "__main__":
+    fig = plt.figure(figsize=(6,6))
+    flux = np.array([
+      [ 0, 1, 0, 0], #OS Sum:2 ; Centos, Ubuntu
+      [ 0, 0, 0, 0], #Plays
+      [ 0, 0, 0, 1], # Cluster: Sum5; Generic, M3, Monarch, SHPC, ACCS
+      [ 0, 0, 1, 2]  #Cloud Sum3: AWS,Nimbus,Nectar
+    ])
+    from numpy import genfromtxt
+
+    flux = genfromtxt('Chord_Diagramm - Sheet1.csv', delimiter=',')
+    ax = plt.axes([0,0,1,1])
+
+    #nodePos = chordDiagram(flux, ax, colors=[hex2rgb(x) for x in ['#666666', '#66ff66', '#ff6666', '#6666ff']])
+    nodePos = chordDiagram(flux, ax)
+    ax.axis('off')
+    prop = dict(fontsize=16*0.8, ha='center', va='center')
+    nodes = ['OS_Centos76','OS_Centos8','OS_Ubuntu1804','PLY_NFSSQL','PLY_MGMT','PLY_Login','PLY_Compute','C_Generic','C_M3','C_Monarch']
+    #nodes = ['M3_MONARCH','SHPC','Ubuntu','Centos7','Centos8','Tested','Security','Nectar','?AWS?','DGX@Baremetal','ML@M3','CVL@UWA','CVL_SW','CVL_Desktop','Strudel','/usr/local']
+    for i in range(len(nodes)):
+        ax.text(nodePos[i][0], nodePos[i][1], nodes[i], rotation=nodePos[i][2], **prop)
+
+    plt.savefig("Chord_Diagramm.png", dpi=600,transparent=False,bbox_inches='tight', pad_inches=0.02)
+    plt.show(pad_inches=0.02)
--- a/docs/images/ardc.png
+++ b/docs/images/ardc.png
--- a/docs/images/massive-website-banner.png
+++ b/docs/images/massive-website-banner.png
--- a/docs/images/monash-university-logo.png
+++ b/docs/images/monash-university-logo.png
--- a/docs/images/university-of-western-australia-logo.png
+++ b/docs/images/university-of-western-australia-logo.png
--- a/docs/sampleinventory.yml
+++ b/docs/sampleinventory.yml
+[SQLNodes]  
+sql1            ansible_host=192.168.0.1    ansible_user=ubuntu
+[NFSNodes]  
+nfs11           ansible_host=192.168.0.2    ansible_user=ubuntu
+[ManagementNodes]  
+mgmt1           ansible_host=192.168.0.3    ansible_user=ubuntu
+mgmt2           ansible_host=192.168.0.4    ansible_user=ubuntu
+[LoginNodes]  
+login1          ansible_host=192.168.0.5    ansible_user=ubuntu
+[ComputeNodes]  
+compute1        ansible_host=192.168.0.6    ansible_user=ubuntu
\ No newline at end of file
--- a/installNFS.yml
+++ b/installNFS.yml
--- 
- 
-  hosts: openvpn-servers
-  remote_user: ec2-user
-  roles:
-    - easy-rsa-common
-    - easy-rsa-CA
-    - easy-rsa-certificate
-    - OpenVPN-Server 
-    - nfs-server
-  sudo: true
-  vars: 
-    x509_ca_server: vm-118-138-240-224.erc.monash.edu.au
- 
-  hosts: openvpn-clients
-  remote_user: ec2-user
-  roles:
-    - easy-rsa-common
-    - easy-rsa-certificate 
-    - OpenVPN-Client
-    - syncExports
-    - nfs-client
-  sudo: true
-  vars: 
-    x509_ca_server: vm-118-138-240-224.erc.monash.edu.au
-    openvpn_servers: ['vm-118-138-240-224.erc.monash.edu.au']
-    nfs_server: "vm-118-138-240-224.erc.monash.edu.au"
--- a/maintenance.yml
+++ b/maintenance.yml
+- hosts: 'ComputeNodes,DGXRHELNodes'
+  gather_facts: false
+  tasks:
+  - include_vars: vars/ldapConfig.yml
+  - include_vars: vars/filesystems.yml
+  - include_vars: vars/slurm.yml
+  - include_vars: vars/vars.yml
+  - { name: set use shared state, set_fact: usesharedstatedir=False }
+  tags: [ never ]
+
+# these are just templates. Not the tag never! Everything with never is only executed if called explicitly aka ansible-playbook --tags=foo,bar OR -tags=tag_group
+- hosts: 'ComputeNodes,DGXRHELNodes'
+  gather_facts: false
+  tasks:
+  - { name: template_shell, shell: ls, tags: [never,tag_group,uniquetag_foo] }
+  - { name: template_command, command: uname chdir=/bin, tags: [never,tag_group,uniquetag_bar] }
+
+- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes'
+  gather_facts: false
+  tasks:
+  - { name: kill user bash shells, shell: 'ps aux | grep -i -e bash -e vscode-server -e zsh -e tmux -e sftp-server -e trungn | grep -v -e "ec2-user" -e ubuntu -e philipc -e smichnow | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root, tags: [never,kickshells]}
+  - { name: Disable MonARCH Lustre Cron Check, cron: name="Check dmesg for lustre errors"  state=absent,become_user: root,become: True ,tags: [never, monarch_disable]  }
+  - name: Re-enable MonARCH Lustre Cron Check
+    cron: name="Check dmesg for lustre errors" minute="*/5" job="/usr/local/sbin/check_lustre_dmesg.sh >>  /tmp/check_lustre_output.txt 2>&1"
+    become: true
+    become_user: root
+    tags: [never, monarch_enable ]
+
+
+- hosts: 'ManagementNodes'
+  gather_facts: false
+  tasks:
+  - name: prep a mgmt node for shutdown DO NOT FORGET TO LIMIT gluster needs 2 out of 3 to run
+    block:
+      # the failover actually works. but it only takes down the primary. so if this would be called from the backup all of slurm would go down
+      #- { name: force a failover shell: /opt/slurm-19.05.4/bin/scontrol takeover }
+      - { name: stop slurmdbd service, service: name=slurmdbd state=stopped }
+      - { name: stop slurmctld service, service: name=slurmctld state=stopped }
+      - { name: stop glusterd service, service: name=glusterd state=stopped }
+      - { name: stop glusterfsd service, service: name=glusterfsd state=stopped }
+    become: true
+    tags: [never,prepmgmtshutdown]
+  - name: verify a mgmt node came up well
+    block:
+#    TODO verify vdb is mounted
+      - { name: start glusterd service, service: name=glusterd state=started }
+      - { name: start glusterfsd service, service: name=glusterfsd state=started }
+      - { name: start slurmctld service, service: name=slurmctld state=started }
+      - { name: start slurmdbd service, service: name=slurmdbd state=started }
+    become: true
+    tags: [never,verifymgmtNode]
+
+- hosts: 'SQLNodes'
+  gather_facts: false
+  tasks:
+  - name: prep a sqlnode node for shutdown
+    block:
+      - { name: stop mariadb service, service: name=mariadb state=stopped }
+      - { name: stop glusterd service, service: name=glusterd state=stopped }
+      - { name: stop glusterfsd service, service: name=glusterfsd state=stopped }
+    become: true
+    tags: [never,prepsqlshutdown]
+  - name: verify an sql node after a restart
+    block:
+      - { name: ensure mariadb service runs, service: name=mariadb state=started }
+      - { name: ensure glusterd service runs, service: name=glusterd state=started }
+      - { name: ensure glusterfsd service runs, service: name=glusterfsd state=started }
+    become: true
+    tags: [never,sqlverify]
+
+- hosts: 'LoginNodes:!perfsonar01'
+  gather_facts: false
+  tasks:
+  - name: set nologin
+    block:
+      - include_vars: vars/slurm.yml
+      - { name: populate nologin file, shell: 'echo "{{ clustername }} is down for a scheduled maintenance." > /etc/nologin', become: true, become_user: root }
+      - { name: set attribute immutable so will not be deleted, shell: 'chattr +i /etc/nologin', become: true, become_user: root }
+    become: true
+    tags: [never,setnologin]
+  - name: remove nologin
+    block:
+      - { name: unset attribute immutable to allow deletion, shell: 'chattr -i /etc/nologin', become: true, become_user: root }
+      - { name: remove nologin file, file: path=/etc/nologin state=absent, become: true, become_user: root }
+    become: true
+    tags: [never,removenologin]
+  - name: terminate user ssh processes
+    block:
+      - { name: kill shells, shell: 'ps aux | grep -i bash | grep -v "ec2-user" | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root }
+      - { name: kill rsync sftp scp, shell: 'ps aux | egrep "sleep|sh|rsync|sftp|scp|sftp-server|sshd" | grep -v "ec2-user" | grep -v "root" | sed "s/\ \ */\ /g" | cut -f 2 -d " " | xargs -I{} kill -09 {}', become: true, become_user: root }
+      - { name: kill vscode, shell: 'pgrep -f vscode | xargs -I{} kill -09 {}', become: true, become_user: root, ignore_errors: true }
+    become: true
+    tags: [never,terminateusersshscprsync]
+
+- hosts: 'LoginNodes,ComputeNodes,DGXRHELNodes,GlobusNodes'
+  gather_facts: false
+  tasks:
+  - name: stop lustre and disable service
+    block:
+      - { name: stop and disable lustre service, service: name=lustre-client enabled=False state=stopped }
+    become: true
+    tags: [never,stopdisablelustre]
+  - name: start lustre and enable service
+    block:
+      - { name: start and enable lustre service, service: name=lustre-client enabled=True state=started }
+    become: true
+    tags: [never,startenablelustre16Aug]
+
+- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes'
+  gather_facts: false
+  tasks:
+  - { name: disable_lustre_service, service: name=lustre-client enabled=no, tags: [never,disable_lustre_service] }
+
+- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes,ManagementNodes'
+  gather_facts: false
+  tasks:
+  - { name: umount /home, mount: path=/home state=unmounted, become: true, become_user: root, tags: [never,umount_home] }
+
+#this should not really end up in the main branch but it does not hurt if it will
+- hosts: 'ComputeNodes,LoginNodes,DGXRHELNodes,ManagementNodes'
+  gather_facts: false
+  tasks:
+  - { name: umount local-legacy, mount: path=/usr/local-legacy state=absent, become: true, become_user: root, tags: [never,umount_locallegacy] }
+
+#!/bin/sh
+#
+#mount | grep gvfs | while read -r line ;
+#do
+#    read -ra line_array <<< $line
+#    echo "umount ${line_array[2]}"
+#done
+
+#un-stuck yum
+#mv /var/lib/rpm/__db* /tmp/
+#mv /var/lib/rpm/.rpm.lock /tmp/
+#mv /var/lib/rpm/.dbenv.lock /tmp
+#yum clean all
--- a/qa.yml
+++ b/qa.yml
+#- hosts: 'all'
+  #gather_facts: false # not sure if false is clever here
+  #tasks:
+  #- include_vars: vars/ldapConfig.yml
+  #- include_vars: vars/filesystems.yml
+  #- include_vars: vars/slurm.yml
+  #- include_vars: vars/vars.yml
+  #- { name: set use shared state, set_fact: usesharedstatedir=False }
+  #tags: [ always ]
+
+# this playbook is roughly sorted by
+# - hostgroupstopics like ComputeNodes or ComputeNodes,LoginNodes, last VisNodes
+# - "tag_groups" each starting after a #comment see #misc or misc tag
+- hosts: 'ComputeNodes'
+  gather_facts: false
+  tasks:
+  # these are just templates.
+  #Note the tag never! Everything with never is only executed if called explicitly aka ansible-playbook --tags=foo,bar OR -tags=tag_group
+  - { name: template_shell, shell: ls, tags: [never,tag_group,uniquetag_foo] }
+  - { name: template_command, command: uname chdir=/bin, tags: [never,tag_group,uniquetag_bar] }
+  - { name: template_scipt, script: ./scripts/qa/test.sh, tags: [never,tag_group,uniquetag_script] }
+#mpi stuff
+  - { name: run mpi on one computenode, command: ls, args: {chdir: "/tmp"} , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_local,TODO] }
+  - { name: run mpi on two computenode, command: ls, args: {chdir: "/tmp"} , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_local_two,TODO] }
+  #- { name: run mpi via sbatch, command: cmd=ls chdir="/tmp" , failed_when: "TODO is TRUE", tags: [never,mpi,slurm_mpi,TODO] }
+  #- { name: mpi_pinging,        command: cmd=ls chdir="/tmp" , failed_when: "TODO is TRUE", tags: [never,mpi,mpi_ping,TODO] }
+   #module load openmpi/3.1.6-ucx;mpirun --mca btl self --mca pml ucx -x UCX_TLS=mm  -n 24 /projects/pMOSP/mpi/parallel_mandelbrot/parallel/mandelbrot
+   #module load openmpi/3.1.6-ucx;srun mpirun --mca btl self --mca pml ucx -x UCX_TLS=mm  -n 24 /projects/pMOSP/mpi/parallel_mandelbrot/parallel/mandelbrot
+
+#slurm
+  - { name: slurmd should be running, service: name=slurmd state=started, tags: [never,slurm,slurmd] }
+  - { name: munged should be running, service: name=munged state=started, tags: [never,slurm,munged] }
+  - { name: ensure connectivity to the controller, shell: scontrol ping, tags: [never,slurm,scontrol_ping] }
+  - { name: the most simple srun test, shell: srun --reservation=AWX hostname, tags: [never,slurm,srun_hostname] }
+#nhc, manually run nhc because it contains many tests
+  - { name: run nhc explicitly, command: /opt/nhc-1.4.2/sbin/nhc -c /opt/nhc-1.4.2/etc/nhc/nhc.conf, become: true , tags: [never,slurm,nhc] }
+# networking
+  - { name: ping license server, shell: ls, tags: [never,network,ping_license] }
+  - { name: ping something outside monash, command: ping -c 1 8.8.8.8, tags: [never,network,ping_external] }
+#mounts
+- hosts: 'ComputeNodes,LoginNodes'
+  gather_facts: false
+  tasks:
+  - { name: check mount for usr_local, shell: "mount | grep -q local", tags: [never,mountpoints,mountpoints_local] }
+  - { name: check mount for projects, shell: "lfs df -h", tags: [never,mountpoints_projects] }
+  - { name: check mount for home, shell: "mount | grep -q home", tags: [never,mountpoints,mountpoints_home] }
+  - { name: check mount for scratch, shell: "mount | grep -q scratch" , tags: [never,mountpoints_scratch] }
+#misc
+  - { name: check singularity, shell: module load octave && octave --version, tags: [never,misc,singularity3] }
+  - { name: module test, shell: cmd="module load gcc" executable="/bin/bash", tags: [never,misc,modulecmd] }
+  - { name: contact ldap, shell: maybe test ldapsearch, failed_when: "TODO is TRUE", tags: [never,misc,ldap,TODO] }
+#gpu
+- hosts: 'VisNodes'
+  gather_facts: false
+  tasks:
+  - { name: run nvida-smi to see if a gpu driver is present, command: "/bin/nvidia-smi", tags: [never,gpu,smi] }
+  - { name: run gpu burn defaults to 30 seconds, command: "/usr/local/gpu_burn/1.0/run_silent.sh", tags: [never,gpu,long,gpuburn] }
+
+
+# extended time-consuming tests
+# relion see https://docs.massive.org.au/communities/cryo-em/tuning/tuning.html
+# linpack
+
+#module load openmpi/1.10.7-mlx;ldd /usr/local/openmpi/1.10.7-mlx/bin/* | grep -ic found
--- a/roles/MonashBioinformaticsPlatform_node_allocation/files/get_node.py
+++ b/roles/MonashBioinformaticsPlatform_node_allocation/files/get_node.py
+#!/usr/bin/python
+import subprocess
+import sys
+
+def getTime():
+    print "How long do you think you need this computer for?"
+    print "If you need the computer for 2 days and 12 hours please enter as 2-12 or 2-12:00:00"
+    time=sys.stdin.readline().strip()
+    try:
+        (days,hours)=time.split('-')
+    except:
+        days=0
+        hours=time
+    try:
+        (hours,minues) = time.split(':')
+    except:
+        pass
+    return (days,hours)
+
+def getNCPUs():
+    print "How many CPUs would you like?"
+    cpus=None
+    while cpus==None:
+        cpustr=sys.stdin.readline().strip()
+        try:
+            cpus=int(cpustr)
+        except:
+            print "Sorry I can't interpret %s as a number"%cpustr
+            print "How many CPUs would you like?"
+
+    return cpus
+
+def getRAM():
+    print "How much RAM would you like (press enter for the default)?"
+    ramstr= sys.stdin.readline().strip()
+    while ramstr!=None and ramstr!="":
+            try:
+                ram=int(ramstr)
+                return ram
+            except:
+                print "Sorry I can't interpret %s as a number"%ramstr
+                print "How much RAM would you like?"
+                ramstr= sys.stdin.readline()
+    return None
+
+def subjob(time,cpus,ram):
+    if ram==None:
+        ram=cpus*2000
+    import subprocess
+    scriptpath='/home/chines'
+    p=subprocess.Popen(['sbatch','--time=%s-%s'%(time[0],time[1]),'--nodes=1','--mincpu=%s'%cpus,'--mem=%s'%ram,'%s/mbpjob.sh'%scriptpath],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
+    (stdout,stderr)=p.communicate()
+    import re
+    m=re.match('Submitted batch job (?P<jobid>[0-9]+)',stdout)
+    if m:
+        return m.groupdict()['jobid']
+
+def isState(jobid,state='RUNNING'):
+    import re
+    p=subprocess.Popen(['scontrol','show','job','-d',jobid],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
+    (stdout,stderr)=p.communicate()
+    jobidre=re.compile('JobId=(?P<jobid>[0-9]+)\s')
+    statere=re.compile('^\s+JobState=(?P<state>\S+)\s')
+    currentjobid=None
+    for l in stdout.splitlines():
+        m=jobidre.match(l)
+        if m:
+            currentjobid=m.groupdict()['jobid']
+        m=statere.match(l)
+        if m:
+            if m.groupdict()['state']==state:
+                if jobid==currentjobid:
+                    return True
+            else:
+                if jobid==currentjobid:
+                    return False
+    return False
+
+def waitjob(jobid):
+    import time
+    while True:
+        if isState(jobid,'RUNNING'):
+            return
+        else:
+            print "job %s not running"%jobid
+            time.sleep(1)
+
+def listJobs():
+    import re
+    r=[]
+    user = subprocess.check_output(['whoami'])
+    jobs = subprocess.check_output(['squeue','-u',user,'-h','-o','"%i %L %j %c"'])
+    jobre=re.compile("(?P<jobid>(?P<jobidNumber>[0-9]+)) (?P<time>\S+ (?P<jobname>\S+) (?P<cpus>[0-9]+))$"
+    for l in jobs.splitlines():
+        m=jobidre.search(l)
+        if m:
+            r.append(m.groupdict())
+    return r
+
+def getNode(jobid):
+    import re
+    stdout=subprocess.check_output(['scontrol','show','job','-d',jobid])
+    for l in stdout.splitlines():
+        m=re.search('^\s+Nodes=(?P<nodelist>\S+)\s',l)
+        if m:
+            nodes=m.groupdict()['nodelist'].split(',')
+            return nodes[0]
+
+def createJob(*args,**kwargs):
+    time=getTime()
+    #cpus=getNCPUs()
+    cpus=1
+    #ram=getRAM()
+    ram=None
+    subjob(time,cpus,ram)
+
+def selectJob(jobidlist):
+    if len(jobidlist)==1:
+        return jobidlist[0]['jobid']
+    else:
+        print "Please select a job (or press enter to cancel)"
+        i=1
+        print "\tJob name\tNum CPUs\tRemaining Time"
+        for j in jobidlist:
+            print "%s\t%s\t%s\t%s"%(i,j['jobname'],j['numcpus'],j['time'])
+        try:
+            jobnum=int(sys.stdin.readline().strip())
+            if (jobnum>0 and jobnum<=jobidlist):
+                return jobidlist[jobnum-1]['jobid']
+        except:
+            pass
+    return None
+
+
+def connect(*args,**kwargs):
+    jobidlist=listJobs()
+    jobid=selectJob(jobidlist)
+    if jobid!=None:
+        waitjob(jobid)
+        node=getNode(jobid)
+        print node
+
+def stop(*args,**kwargs):
+    jobidlist=listJobs()
+    jobid=selectJob(jobidlist)
+    if jobid!=None:
+        stopjob(jobid)
+
+
+
+def main():
+    import argparse
+    parser = argparse.ArgumentParser()
+    subparser = parser.add_subparsers()
+
+    start = subparser.add_parser('start', help='alloate a node to the user')
+    start.set_defaults(func=createJob)
+
+    connect = subparser.add_parser('connect')
+    start.set_defaults(func=connect)
+
+    stop = subparser.add_parser('stop')
+    start.set_defaults(func=stop)
+        
+    args = parser.parse_args()
+    args.func(args)
+
+try:
+    jobidlist=listJobs()
+    if len(jobidlist)>1:
+        print "cancel all jobs here"
+    jobidlist=listJobs()
+    if len(jobidlist)==0:
+        time=getTime()
+        #cpus=getNCPUs()
+        cpus=1
+        #ram=getRAM()
+        ram=None
+        subjob(time,cpus,ram)
+    jobidlist=listJobs()
+    if len(jobidlist)==1:
+        jobid=jobidlist[0]
+        waitjob(jobid)
+        node=getNode(jobid)
+        print node
+        sys.exit(0)
+except Exception as e:
+    print e
+    import traceback
+    print traceback.format_exc()
+    sys.exit(1)
+
+main()
--- a/roles/MonashBioinformaticsPlatform_node_allocation/files/mbp_node
+++ b/roles/MonashBioinformaticsPlatform_node_allocation/files/mbp_node
+#!/bin/bash 
+mpbctrl='/home/hines/mbp_script/get_node.py'
+node=$( $mbpctrl $1 )
+if [[ $node ]]; then
+    ssh -t $node tmux attach-session
+fi
--- a/roles/MonashBioinformaticsPlatform_node_allocation/tasks/main.yml
+++ b/roles/MonashBioinformaticsPlatform_node_allocation/tasks/main.yml
+---
+- name: make sure /usr/local/bin exists
+  file: path=/usr/local/bin state=directory mode=755 owner=root
+  become: true
+
+- name: install get_node.py
+  copy: src=get_node.py dest=/usr/local/bin/get_node.py mode=755 owner=root
+  become: true
+
+- name: install mbp_node
+  copy: src=mbp_node dest=/usr/local/bin/mbp_node mode=755 owner=root
+  become: true
--- a/roles/OSImageBugfix/tasks/main.yml
+++ b/roles/OSImageBugfix/tasks/main.yml
+---
+# This role is to  fix a misconfiguration of some OpenStack Base images at Monash University. 
+# the misconfiguration is dev/vdb mounted in fstab of the Image and the Openstack Flavour not providing a second disk.
+- name: unmount vdb if absent
+  mount:
+    path: "/mnt"
+    src: "/dev/vdb"
+    state: absent
+  become: true
+  when: 'hostvars[inventory_hostname]["ansible_devices"]["vdb"] is not defined'
+
+- name: keep mnt present
+  file:
+    path: "/mnt"
+    owner: root
+    group: root
+    mode: "u=rwx,g=rx,o=rx"
+    state: directory
+  become: true
+  when: 'hostvars[inventory_hostname]["ansible_devices"]["vdb"] is not defined'
No results found