We recommend reviewing two specific documents created by Slurm, as they will be very useful for the success of using Posit Workbench's Job Launcher service with the Slurm integration:
Question: How do I verify the Slurm cluster functionality?
Answer: To verify that your Slurm cluster is functional and accepting/running jobs, you can perform the pre-flight configuration checks documented in the steps for Configuring Posit Workbench with Launcher and Slurm.
Question: How do I verify Posit Workbench (previously RStudio Workbench/RStudio Server Pro) with Launcher and Slurm?
Answer: Run the following command to test the installation and configuration of Posit Workbench with Launcher and Slurm:
sudo rstudio-server stop
sudo rstudio-server verify-installation --verify-user=<USER>
sudo rstudio-server start
Replace <USER>
with a valid username of a user that is setup to run Posit Workbench in your installation.
Refer to the Troubleshooting section in the Posit Workbench Administration Guide for more information on using the Launcher verification tool.
Question: Where are the logs stored for Posit Workbench and Launcher?
Answer: Depending on your installation, the logs for Posit Workbench and Launcher can be found at:
Posit Workbench 2021.09+
/var/log/rstudio/rstudio-server/
/var/log/rstudio/launcher/
/var/log/rstudio/launcher/Slurm/
For earlier versions of Posit/RStudio Workbench/RStudio Server Pro:
/var/lib/rstudio-server/monitor/log/rstudio-server.log
/var/lib/rstudio-launcher/rstudio-launcher.log
/var/lib/rstudio-launcher/Slurm/rstudio-slurm-launcher.log
You can inspect these logs for errors after attempting to launch a session or job on Slurm.
Question: How does Posit Workbench use Slurm?
Answer: The Posit Workbench Slurm Launcher Plugin uses the Slurm command line tools to control the Slurm cluster. Commands are run as either:
- the user starting or viewing the job
- the
slurm-service-user
(configured in the launcher.slurm.conf)
Note: The slurm-service-user
should be a user that has administrative access to the Slurm cluster - they should be able to see job details for all users.
Question: What does the Posit Workbench Launcher Host require?
Answer: The Posit Workbench Launcher Host requires:
- the same slurm.conf file as the desired Slurm cluster
- network access to all of the Slurm compute nodes and the control node
- the Slurm command line tools installed (please see below)
- file-sharing configured with compute nodes
- It is not necessary to have
slurmctld
(the Slurm Control Daemon) orslurmd
(the Slurm Compute Daemon) running on the Posit Launcher Host to make the configuration work. It is necessary to haveslurmctld
running on the Slurm Control Node, and at leastslurmd
running on a Slurm Compute Node for everything to work. Note: if you are using an authentication plugin (an add-on to Slurm that manages user authentication across the cluster) that does have to be installed and running. The Slurm Quick Start Administrator Guide refers to MUNGE - that's the recommended authentication plugin. - To get an idea of Slurm's architecture, please see the diagram in the Slurm's Quick Start User Guide
Question: What are some of the common Slurm command line tools mentioned above and where do I run these?
Answer: See below for the common commands. These commands are run from the Posit Launcher Host machine. These commands should not be run as root, they should either be run as the user experiencing issues or as the slurm-service-user
. Unless otherwise specified, these commands should be run as the user experiencing issues.
-
sinfo
- to list queues/partitions for the New Session and Run Script dialogs. Good to check general connectivity. -
sinfo --format=%R --noheader
- to check current queues/partitions -
sbatch
- used to submit jobs -
scontrol show job
- used to view and modify configuration and state -
scontrol show job [job id]
- to view and modify configuration and state of a specific job -
squeue
- used to get job status updates (note that this command is always run by theslurm-service-user
) -
sstat
- used to get resource utilization metrics (note this is always run by theslurm-service-user
). This command requires that Job Account Gathering is enabled inslurm.conf
-
tail -f
- used to stream job output data - For more information on the Slurm command line tools, please view Slurm's cheat sheet.
Question: Do you have general guidance on troubleshooting issues with the Slurm Launcher Plugin?
Answer: Start by looking for errors in the output of sudo rstudio-launcher status
. If there are no errors or if they are vague, enable debug logging and check /var/lib/rstudio-launcher/Slurm/rstudio-slurm-launcher.log
. This documentation and this FAQ also have useful information.
Question: How do I troubleshoot a version warning?
Answer: You may see version warnings if you are not using one of our supported Slurm versions. See the Slurm documentation for a complete list. We would recommend checking for errors parsing Slurm commands in the rstudio-slurm-launcher.log
.
Question: How do I troubleshoot startup failures? The Slurm Launcher Plugin does not seem to be working.
Answer:
- Is the Slurm cluster running?
- If no, start the Slurm Cluster and try again. If the Slurm Cluster is still not running, we would recommend checking the SlurmctldLogFile and SlurmLogFile (both configured in the slurm.conf) for errors.
- Are the Slurm command line tools installed on the Posit Workbench Launcher Host?
- If no, please install the Slurm command line tools using Slurm Quick Start Administrator Guide.
- If the Slurm cluster is running and the Slurm command line tools are installed, is the output of running
sinfo
from the Posit Workbench Launcher Host correct?- It is recommended to double-check that the slurm.conf on the Posit Workbench Launcher Host is the same as the slurm.conf on the desired Slurm Cluster. If it is not, update the Posit Workbench Launcher slurm.conf and then have the user try again.
- Can the DNS and/or IP Address of the Slurm nodes be resolved? Try running
ping <slurm control node hostname>
from the Posit Workbench Workbench. If this fails, we'd suggest updating your /etc/hosts as necessary. - If yes and you continue to have problems, we'd recommend contacting Posit Support.
Question: How do I troubleshoot missing queues/partitions?
Answer: If there are missing queues/partitions in any of the job launcher dialogs, the user should check the output of sinfo --format=%R --noheader
. This would be run as any user experiencing the problem. If the list here is wrong or not expected, the Slurm configuration should be investigated. If the list is correct, please contact Posit Support.
Question: What should I do if I have Job or Session failures?
Answer:
- To the Slurm Launcher Plugin, a session is just a job
- Run
scontrol show job
, does the job appear?- No - the errors should be in the Slurm Launcher Plugin log file
- Yes - the errors should be in the job error output (see below)
Question: How to troubleshoot the job status not updating?
Answer:
- Not all Slurm job states are reflected as separate RStudio Job Statuses
- Has the job status actually changed? Check the output of
squeue --state=all --Format=jobid:10, name:75, username, state
. Run this as theslurm-service-user
. If the answer to this is yes, please contact Posit Support. - Below is the idea of the mapping Posit put in place between RStudio Job Status and Slurm Job State
Question: There is no job output, how do I fix this?
Answer:
- Can the job output file be reached? Try running
ls -l <StdOut or StdErr path>
- If yes, what does
cat <StdOut or StdErr path>
look like? - If both of those look normal, we'd recommend contacting Posit Support.
Question: I can't enter a Session, how do I fix this?
Answer: The below steps assume the session status is idle from the Posit Workbench Home Page.
- Can the session job output be read? Try checking the job details page.
- Can all Slurm compute nodes be reached by the Posit Workbench Host?
- Is there a firewall preventing a connection over the Session Port (a random port from the ethereal port range)?
- If the answers to the above are Yes, Yes, and No, the next steps are to diagnose session issues as without the Launcher.
Question: Why am I not seeing any resource metrics?
Answer:
- Is Slurm's Job Account Gathering feature enabled? If not, please view Slurm's jobacct_gather plugin configuration to get it configured.
- Is the resource metric data printed when
sstat --format=AveCpu, AveVMSize, AveRSS
is run as theslurm-service-user
?
Question: What do these log entries mean?
Comments