Skip to content
Snippets Groups Projects
user avatar
Simon Michnowicz authored
d36681e0
History

HPCasCode

pipeline status Issues

  1. Introduction / Purpose
  2. Getting started
  3. Features
  4. Assumptions
  5. Configuration
  6. Contribute and Collaborate
  7. Used by
  8. CICD Coverage
  9. Roadmap

Introduction / Purpose TODO

The purpose of this repository is to deploy a High Performance Computing [HPC] Systems using Infrastructure-as-Code [IaC] principles predominantly using ansible. The aim is to follow the principles of InfrastructureAsCode and repurpose the principles for HPC Systems.

By encoding the system state the following advantages are gained which also define the values of this project:

  • collaboration it is simply easier to share code than systems. This repository also aims to serve as a place for backup and discussions. And in the near future even documentation :)
  • redeployability if a system follows the same build recipes and the deployment is automated all installations should be similar even Test and Production. You will also hear the term Immutable infrastructure for this.
  • CICD automation aka Test- and change automation allow us to put safegard in place increasing our quality and also to ease the burden of change management.
  • modular and reusable For example we currently support two Operation system and also Baremetal and openstack based deployment. AWS support is also in scope.

The scope if this repository ranges from an ansible inventory ( although openstack and later AWS will be supported ) and ends with a gpu powered desktop provisioning system ontop of the ressource scheduler slurm. Services for example an identity system or package repository management are in scope but future work.

Getting Started TODO

Ok all CAPS incoming. TAKE CARE OF CICD/vars/passwords.yml best case, use ansible vault at the same time you start modifying that file. Given the current state of documentation please feel free to reach out and ask for clarification. This project or repository is far from done or being perfect!

Features

  • Currently supports Centos 7, Centos 8 and Ubuntu 1804
  • CICD tested including spawning a cluster, MPI and slurm tests
  • Rolling node updates is currently work in progress.
  • Coming up: strudel2 desktop integration

Assumptions

  • The ansible inventory file needs to be in the following format: nodetypes 1x[SQLNodes], 1x[NFSNodes], 2x[ManagementNodes], >=0 LoginNodes, >=0 ComputeNodes, >=0 VisNodes ( GPU Nodes ) . SQL and NFS can be combined. The ManagementNodes are managing slurm access and job submission. Here is an example
  • The filesystem layout assumes the following shared drives /home, /projects for project data, /scratch for fast compute, /usr/local for provided software see here
  • TODO rewrite a populated vars folder as in the CICD subfolder. See configuration for any details and apologies
  • The software stack is currently provided as a mounted shared drive. Software can be loaded via Environment-Modules. Containerisation is also heavily in use. Please reach out for further details and sharing requests.
  • Networking is partially defined on this level, and partially on the Baremetal level. This is for now a point to reach out and discuss. Current implementations support public facing LoginNodes and the rest being in a private Network with a NAT-Gateway. OR a private Infiniband network in conjunction with a 1G network.
  • a shared ssh key for a management user

Configuration

Configuration is defined in the variables (vars) in the vars folder. See CICD/vars. These vars are evolving over time with the necessary refactoring and abstraction. These variables have definitely grown over time when this set of ansible roles was developed, first for a single system and then to multiple sytems. The best way to interpret their usage currently is to search or grep VARIABLENAME this repository and see how they are used. This is not pretty for external use. Lets not even pretend it is.

  • filesystems.yml covers filesystem types and exports/mounts
  • ldapConfig.yml covers identity via LDAP
  • names.yml to store the domain name
  • passwords.yml containing passwords in a single file to be encrypted via ansible-vault
  • slurm.yml containing all slurm variables
  • vars.yml contains variables which don't belong somewhere else

How do I contribute or collaborate

  • Get in contact, use the issue tracker and if you want to contribute documentation or code or anything else we offer handholding for the first merge request.
  • A great first start is to get in contact, tell us what you want to know and help us improve the documentation
  • Please contact us via andreas.hamacher(at)monash.edu or help(at)massive.org.au
  • Contribution guidelines would also be a good contribution :)

Used by:

Monash University MASSIVE Australian Research Data Commons University of Western Australia

CI Coverage

  • Centos7.8, Centos8, Ubuntu1804
  • All node types as outlined in Assumptions
  • vars for Massive, Monarch and a Generic Cluster ( see files in CICD/vars )
  • CD in progress using autoupdate.py

Roadmap

  • soon this section is to be moved into the Milestones on gitlab.
  • Desktop integration using Strudel2. Contributors are welcome to integrate OpenOnDemand
  • CVL integration see github.com/Characterisation-Virtual-Laboratory
  • Automated Security Checks as part of the CICD pipeline
  • Integration of a FOSS identity system currently only a token LDAP is supported
  • System status monitoring and alerting

Nice read titled Infrastructure as Code DevOps principle: meaning, benefits, use cases