adding a script to catch a generic nvidia-smi error
this should catch the issue on the m3a nodes and similar in the future
tested on m3p006 ( good case ) and m3a101 ( bad case )
this should catch the issue on the m3a nodes and similar in the future
tested on m3p006 ( good case ) and m3a101 ( bad case )