We recommend reviewing two specific documents created by Slurm, as they will be very useful for the success of using Posit Workbench's Job Launcher service with the Slurm integration:
Question: How do I verify the Slurm cluster functionality?
Answer: To verify that your Slurm cluster is functional and accepting/running jobs, you can perform the pre-flight configuration checks documented in the steps for Configuring RStudio Workbench with Launcher and Slurm.
Question: How do I verify Posit Workbench (previously RStudio Workbench/RStudio Server Pro) with Launcher and Slurm?
Answer: Run the following command to test the installation and configuration of Posit Workbench with Launcher and Slurm:
sudo rstudio-server stop
sudo rstudio-server verify-installation --verify-user=<USER>
sudo rstudio-server start
<USER> with a valid username of a user that is setup to run Posit Workbench in your installation.
Refer to the Troubleshooting section in the Posit Workbench Administration Guide for more information on using the Launcher verification tool.
Question: Where are the logs stored for Posit Workbench and Launcher?
Answer: Depending on your installation, the logs for Posit Workbench and Launcher can be found at:
Posit Workbench 2021.09+
For earlier versions of Posit/RStudio Workbench/RStudio Server Pro:
You can inspect these logs for errors after attempting to launch a session or job on Slurm.
Question: How does Posit Workbench use Slurm?
Answer: The Posit Workbench Slurm Launcher Plugin uses the Slurm command line tools to control the Slurm cluster. Commands are run as either:
- the user starting or viewing the job
slurm-service-user(configured in the launcher.slurm.conf)
slurm-service-user should be a user that has administrative access to the Slurm cluster - they should be able to see job details for all users.
Question: What does the Posit Workbench Launcher Host require?
Answer: The Posit Workbench Launcher Host requires:
- the same slurm.conf file as the desired Slurm cluster
- network access to all of the Slurm compute nodes and the control node
- the Slurm command line tools installed (please see below)
- file-sharing configured with compute nodes
- It is not necessary to have
slurmctld(the Slurm Control Daemon) or
slurmd(the Slurm Compute Daemon) running on the RStudio Launcher Host to make the configuration work. It is necessary to have
slurmctldrunning on the Slurm Control Node, and at least
slurmdrunning on a Slurm Compute Node for everything to work. Note: if you are using an authentication plugin (an add-on to Slurm that manages user authentication across the cluster) that does have to be installed and running. The Slurm Quick Start Administrator Guide refers to MUNGE - that's the recommended authentication plugin.
- To get an idea of Slurm's architecture, please see the diagram in the Slurm's Quick Start User Guide
Question: What are some of the common Slurm command line tools mentioned above and where do I run these?
Answer: See below for the common commands. These commands are run from the RStudio Launcher Host machine. These commands should not be run as root, they should either be run as the user experiencing issues or as the
slurm-service-user. Unless otherwise specified, these commands should be run as the user experiencing issues.
sinfo- to list queues/partitions for the New Session and Run Script dialogs. Good to check general connectivity.
sinfo --format=%R --noheader- to check current queues/partitions
sbatch- used to submit jobs
scontrol show job- used to view and modify configuration and state
scontrol show job [job id]- to view and modify configuration and state of a specific job
squeue- used to get job status updates (note that this command is always run by the
sstat- used to get resource utilization metrics (note this is always run by the
slurm-service-user). This command requires that Job Account Gathering is enabled in slurm.conf
tail -f- used to stream job output data
- For more information on the Slurm command line tools, please view Slurm's cheat sheet.
Question: Do you have general guidance on troubleshooting issues with the Slurm Launcher Plugin?
Answer: Start by looking for errors in the output of
sudo rstudio-launcher status. If there are no errors or if they are vague, enable debug logging and check /var/lib/rstudio-launcher/Slurm/rstudio-slurm-launcher.log. This documentation and this FAQ also have useful information.
Question: How do I troubleshoot a version warning?
Answer: You may see version warnings if you are not using our only supported version (20.02.x). We would recommend checking for errors parsing Slurm commands in the rstudio-slurm-launcher.log.
Question: How do I troubleshoot startup failures? The Slurm Launcher Plugin does not seem to be working.
- Is the Slurm cluster running?
- If no, start the Slurm Cluster and try again. If the Slurm Cluster is still not running, we would recommend checking the SlurmctldLogFile and SlurmLogFile (both configured in the slurm.conf) for errors.
- Are the Slurm command line tools installed on the Posit Workbench Launcher Host?
- If no, please install the Slurm command line tools using Slurm Quick Start Administrator Guide.
- If the Slurm cluster is running and the Slurm command line tools are installed, is the output of running
sinfofrom the Posit Workbench Launcher Host correct?
- It is recommended to double-check that the slurm.conf on the Posit Workbench Launcher Host is the same as the slurm.conf on the desired Slurm Cluster. If it is not, update the Posit Workbench Launcher slurm.conf and then have the user try again.
- Can the DNS and/or IP Address of the Slurm nodes be resolved? Try running
ping <slurm control node hostname>from the Posit Workbench Workbench. If this fails, we'd suggest updating your /etc/hosts as necessary.
- If yes and you continue to have problems, we'd recommend contacting Posit Support.
Question: How do I troubleshoot missing queues/partitions?
Answer: If there are missing queues/partitions in any of the job launcher dialogs, the user should check the output of
sinfo --format=%R --noheader. This would be run as any user experiencing the problem. If the list here is wrong or not expected, the Slurm configuration should be investigated. If the list is correct, please contact Posit Support.
Question: What should I do if I have Job or Session failures?
- To the Slurm Launcher Plugin, a session is just a job
scontrol show job, does the job appear?
- No - the errors should be in the Slurm Launcher Plugin log file
- Yes - the errors should be in the job error output (see below)
Question: How to troubleshoot the job status not updating?
- Not all Slurm job states are reflected as separate RStudio Job Statuses
- Has the job status actually changed? Check the output of
squeue --state=all --Format=jobid:10, name:75, username, state. Run this as the
slurm-service-user. If the answer to this is yes, please contact Posit Support.
- Below is the idea of the mapping Posit put in place between RStudio Job Status and Slurm Job State
Question: There is no job output, how do I fix this?
- Can the job output file be reached? Try running
ls -l <StdOut or StdErr path>
- If yes, what does
cat <StdOut or StdErr path>look like?
- If both of those look normal, we'd recommend contacting Posit Support.
Question: I can't enter a Session, how do I fix this?
Answer: The below steps assume the session status is idle from the Posit Workbench Home Page.
- Can the session job output be read? Try checking the job details page.
- Can all Slurm compute nodes be reached by the Posit Workbench Host?
- Is there a firewall preventing a connection over the Session Port (a random port from the ethereal port range)?
- If the answers to the above are Yes, Yes, and No, the next steps are to diagnose session issues as without the Launcher.
Question: Why am I not seeing any resource metrics?
- Is Slurm's Job Account Gathering feature enabled? If not, please view Slurm's jobacct_gather plugin configuration to get it configured.
- Is the resource metric data printed when
sstat --format=AveCpu, AveVMSize, AveRSSis run as the
Question: What do these log entries mean?