Scaling R and RStudio


The following document presents some FAQs for scaling R and the RStudio IDE.

Q: I want to develop a platform to scale R for my organization. Can RStudio Workbench or RStudio Server help?

The first step is to determine the type of scale you are hoping to achieve. The following table presents an overview of the three most common cases.

Use Case Problem Solutions Technology
Scaling for Many R Users Regular R workflows for a team. Includes loading data subsets from files or warehouses Create a platform to support large-scale individual interactive R session(s) and jobs RStudio Server, RStudio Workbench + Load Balancer, RStudio Workbench + Launcher
Scaling for HPC Embarrassingly parallel tasks like: bootstrapping, cross validation, scoring, model fitting on independent groups Develop code in an interactive R session in RStudio. Submit code in batch jobs on compute R processes. R must be installed on all compute nodes. Local: parallel, Rmpi, snow, Rcpp parallel;
Cluster: RStudio Workbench + Launcher, Kubernetes, Slurm, LSF, Torque, Docker;
Recommendation: batchtools and clustermq package
Scaling for Big Data Big data, black box routines that require fitting a model against an entire domain space. Data can’t fit on one machine. R is an orchestration engine. Heavy lifting is done by a different compute engine on the cluster. R syntax is used to construct pipelines, and R is used to analyze results. Hadoop, Spark, Tensorflow, Oracle BDA, Microsoft R Server, Aster,

RStudio Workbench (previously RStudio Server Pro) is designed to help your organization scale for a team of R users. The tool includes features for project sharing, collaborative editing, session management, and IT administration tools like authentication, audit logs, and server performance metrics.

Q: What is the difference between RStudio Workbench's Load Balancer vs. Launcher?

Both RStudio Workbench's Load Balancer and Launcher are designed to support larger teams of data scientists. The Load Balancer allows you to configure two or more servers with RStudio Workbench and balance R sessions and jobs between servers. Launcher allows you to configure RStudio Workbench with an external resource manager such as Kubernetes or Slurm.

Shared workloads and resource management - R sessions and jobs that are running on a load-balanced cluster will not be aware of workloads running on your external resource manager such as Kubernetes and Slurm and could overload and oversubscribe the system resources. Whereas R sessions and jobs that are spawned via Launcher will be submitted alongside other jobs on your external resource manager such as Kubernetes or Slurm.

Scaling out horizontally - When expanding a load-balanced cluster, you will need to provision additional servers with RStudio Workbench and the required R and system dependencies. When expanding a cluster that is configured with Launcher, you can provision additional worker nodes (Kubernetes) or additional compute nodes (Slurm) separate from the base installation of RStudio Workbench.

Refer to the overview on RStudio Workbench with Launcher and FAQ for RStudio Workbench with Launcher and Kubernetes for more information on scaling your R workloads with external resource managers.

Q: Will RStudio Workbench’s load balancer run my R job across a cluster?

No. RStudio Workbench’s load balancer balances R sessions across the cluster. Each individual R session remains on a single server. Any parallelization across the cores on the server or across the cluster will require the R analyst to write or submit parallel code. (See scaling for HPC).

A load-balanced RStudio Workbench cluster is designed to support larger teams of data scientists. The load balancer ensures that a new R session will go to machine with the most availability, and that features like the admin dashboard and project sharing will scale as you add RStudio nodes.

Q: Will RStudio Workbench’s Launcher run my R jobs in parallel across a cluster?

RStudio Workbench’s Launcher feature spawns R sessions and jobs on external resource managers such as Kubernetes and Slurm. Each individual R session or job remains on a single Docker container (Kubernetes) or compute node (Slurm). Any parallelization across the cores on the server or across the cluster will require the R analyst to write or submit parallel code. (See scaling for HPC).

An RStudio Workbench instance configured with Launcher is designed to support larger teams of data scientists by making use of scalable resource managers for workloads. Launcher ensures that a new R session or job will be submitted to your external resource manager, and that you can add more compute resources for a growing user base by adding more worker nodes (Kubernetes) or compute nodes (Slurm).

Q: I have an HPC cluster. (LSF, Slurm, Torque). Do I need RStudio on each node?

Typically no. RStudio is used by analysts who are running R interactively. If you need to support many R users, it may make sense to install RStudio Workbench on a number of nodes and load balance between them.

Usually HPC systems are designed for batch job submission. In R, this is usually done by submitting R scripts that each run a small, independent part of a bigger problem. (Or, a single R script may be submitted many times.) Alternatively, a single R script that includes explicit code to parallelize across multiple cores or a cluster could be submitted. Either way, these scripts are usually written and tested interactively, but then submitted in batch for a full run. You could install RStudio Workbench on one of the HPC nodes to aid in developing, testing, and debugging these R scripts, but the actual job that requires the cluster will be executed in batch. This batch submission requires R, but not RStudio, to be installed on every node.

In a setup with RStudio Workbench and Launcher with Slurm, you can install RStudio Workbench on one node in the Slurm cluster and R on all of the compute nodes to spawn R jobs. To enable interactive sessions across the cluster, each compute node will require RStudio Workbench session components.

Q: I have a Hadoop cluster. Where should I install RStudio?

There are many ways to interact with Hadoop from R. One of the most popular solutions is to use R in combination with Spark. In this workflow, R is an orchestrator. The analyst writes R code, and the R code in turn directs the heavy-lifting to a separate computational engine (Spark). As an orchestrator, R is communicating extensively with the cluster. Often small, aggregated results are brought back into R for further analysis. For those reasons, it is recommended to run R and RStudio on an edge node of the cluster.

A few solutions that follow this workflow include: sparklyr, Microsoft R Server,

Q: I have a Data Appliance that supports R (Oracle BDA, Teradata, SAP Hana, Microsoft SQL Server). How can I use RStudio?

There are usually two types of integration between these tools and R. (Some of the tools support both types.)

Type 1: The appliance calls R, which returns its results to the appliance.

Many appliances define their own processing step that reaches out to R. For example, an analyst can write a SQL statement that includes a calculated column, where the calculation is an R function call.

RStudio cannot be directly used in this case. However, the R analyst creating the function can develop and test the code in RStudio.

Type 2: R calls the appliance, which returns its results to R.

In this use case, the appliance is treated as a data source or external computation engine for R. For example, I might write a query that returns a subset of the data into R. Or, I might push the computation of a supported model into the data warehouse. Usually the integration between R and the Appliance is provided by a specialized R package.

For Type 2, the RStudio IDE is used. The R package abstracts the details of communicating and accessing the appliance.