Where RStudio Workbench and RStudio Server store data

Follow

Infrastructure

The core infrastructure of a simple enterprise analytic workbench will probably consist of: (1) an enterprise data warehouse; (2) analytic engines; and (3) file shares. Each of these tools is used to store and process data.

Infrastructure

Enterprise data warehouse

Enterprise data warehouses (EDW) contain live data for storage and analysis. They are built with a scalable architecture and typically comes in two flavors: unstructured (e.g. Hadoop) or structured (SQL). The EDW is seen as the primary repository for pre-processed and post-processed data as well as essential views, aggregates, and lookups that make the data meaningful. Analysts either share the EDW with other business groups or have some part of it dedicated to them to support analytic processes. In some cases, analysts have their own dedicated EDW.

File shares

The file share provides a common storage platform for the entire enterprise and will probably connect to all major systems. They are also built on a scalable architecture, but unlike the EDW their primary purpose is to store files, not process them. The file share is a great place to store shared project information such as presentations, spreadsheets, timelines, and notes. The file share is used for on-going storage of any file format including raw data stored as flat files. These drives are typically backed up automatically. The file share typically is the location for the user home directories as well. The home directories contain configuration files and other important files related to the user.

Analytic engines

Analytic engines are the tools analysts use to understand, target, measure, and optimize their data. They contain specialized analytic methods not found in other systems and can also be used for building analytic products and applications. The analytic engines are typically installed on dedicated servers designed to scale according to analytic needs. These servers have internal or attached hard drives that can be used for temporary storage. They are connected to the data warehouse and file share servers and function as part of analytic workflows that ingest, process and deliver information.

Analytic workflow needs

Analytic workflows use the infrastructure above to store different types of files. The first type is source data which is used to carry out the analysis. Source data is often stored in the data warehouse. The second type is temporary files which may be created and written to file to be used only during the analysis. The temporary files are typically written to scratch space allocated to the analytic servers. The third type is project files which can be used to store all the code, documents, and data snapshots necessary for the project. The project files can be stored on file share directories allocated to the project teams. The fourth type is user specific data which can be used to store configuration files and other user specific information.

Where RStudio Server stores data

R stores all its data in memory but writes to data warehouses, scratch space, project files, and home directories. RStudio Server writes specific information to the project directories and the home directories.

Storage

Analytic File Types Location Example
1. Data tables Data warehouse Tables (create table dat as)
2. Temporary files Analytic server scratch space Working files (write dat.csv)
3. Project files File share project directories Code, documents, data snapshots (save dat.RData)
4. User specific files File share home directories User configuration and session data (.rstudio)

Project files

When you create a new project in RStudio Server the system creates a small configuration file called .Rproj. RStudio Server looks for this file so it can identify the directory as a project. It also creates a file called .Rproj.user that contains other settings about that project, such as project sharing in RStudio Workbench (previously RStudio Server Pro). You might also store your code in the project directory using version control system such as Git or SVN.

When you close a project down, you will be asked if you want to save your workspace image to an .RData file. If you select yes, all of the data stored in memory will be written to your project directory. This feature is part of R and has always been included as a way to restore your workspace at some later date.

User specific files

The first time a user logs into RStudio Server a directory called .rstudio is created in that users home directory. The .rstudio contains information such as user history logs, dictionaries, add-ins, client state, R version settings, and much more. This information is necessary to support a good user experience.

RStudio Server contains a feature that will allow it to automatically save suspended sessions to file. If a session has been idle for too long, RStudio Server will suspend the session and save the image under ~/.rstudio/sessions which is located on the home drive. This feature keeps the shared server environment clean while allowing analysts to resume their work at any time.

Notice that when users manually close a session they can save their workspace image into .RData under the project directory while RStudio Server will save suspended sessions into .rstudio under the home directory. If you want to turn off the suspended session feature you can set the following options in /etc/rstudio/rsession.conf:

session-timeout-minutes=0

As of RStudio version 1.0 the session timeout can optionally be configured per user or for specific groups by adding session-timeout-minutes in the /etc/rstudio/profiles file, for example: 

[@powerusers]

session-timeout-minutes=0

A typical analytic workflow in R will read information into memory, perform some analytic operation on the data, and read information back out to disk. If the user closes the session down manually, the workspace image will be written the project directory. If RStudio Server suspends the session automatically, the session will be written to the home directory.

 

Comments