Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • H HPCasCode
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 8
    • Issues 8
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 1
    • Merge requests 1
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • hpc-team
  • HPCasCode
  • Issues
  • #9

Closed
Open
Created Apr 03, 2020 by Andreas Hamacher@handreasOwner

ansible restart improvements

Andreas H 12:32 PM is there a really good example of a node reboot somewhere in our ansible work ?

lancew:koala: 12:37 PM I'd like to see one too. If there is anything wrong with lustre or its modules the reboot fails and requires a hard reboot from openstack.

chines 3:23 PM @Andreas H... not really. The old pattern looks like this 94 - name: Restart host to remove nouveau module 95 shell: "sleep 2 && shutdown -r now &" 96 async: 1 97 poll: 1 98 become: true 99 ignore_errors: true 100 when: modules_result.stdout.find('nouveau') != -1 101 102 - name: Wait for host to reboot 103 local_action: wait_for host="{{ inventory_hostname }}" search_regex=OpenSSH port=22 delay=60 timeout=900 but I think the new pattern looks like 95 - name: restart machine 96 shell: "sleep 5; sudo shutdown -r now" 97 async: 2 98 poll: 1 99 ignore_errors: true 100 become: true 101 become_user: root 102 when: reboot_now 103 104 - name: waiting for server to come back 105 wait_for_connection: sleep=60 timeout=600 delay=60 106 when: reboot_now 3:24 the async directive allows it to disconnect before the process completes. wait_for_connection is better than local_action wait_for 3:24 but as lance mentions the restart (shutdown -r now) will fail under some conditions

lancew:koala: 3:45 PM Oh also forgot any hanging NFS file open/writes, which we have significant numbers of. I haven't been checking recently but I was tracking of the the order of hundreds per day in the home folders

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking