Infrastructure
The core infrastructure of a simple enterprise analytic workbench will probably consist of: (1) an enterprise data warehouse; (2) analytic engines; and (3) file shares. Each of these tools is used to store and process data.
Enterprise data warehouse
Enterprise data warehouses (EDW) contain live data for storage and analysis. They are built with a scalable architecture and typically comes in two flavors: unstructured (e.g. Hadoop) or structured (SQL). The EDW is seen as the primary repository for pre-processed and post-processed data as well as essential views, aggregates, and lookups that make the data meaningful. Analysts either share the EDW with other business groups or have some part of it dedicated to them to support analytic processes. In some cases, analysts have their own dedicated EDW.
File shares
The file share provides a common storage platform for the entire enterprise and will probably connect to all major systems. They are also built on a scalable architecture, but unlike the EDW their primary purpose is to store files, not process them. The file share is a great place to store shared project information such as presentations, spreadsheets, timelines, and notes. The file share is used for on-going storage of any file format including raw data stored as flat files. These drives are typically backed up automatically. The file share typically is the location for the user home directories as well. The home directories contain configuration files and other important files related to the user.
Analytic engines
Analytic engines are the tools analysts use to understand, target, measure, and optimize their data. They contain specialized analytic methods not found in other systems and can also be used for building analytic products and applications. The analytic engines are typically installed on dedicated servers designed to scale according to analytic needs. These servers have internal or attached hard drives that can be used for temporary storage. They are connected to the data warehouse and file share servers and function as part of analytic workflows that ingest, process and deliver information.
Analytic workflow needs
Analytic workflows use the infrastructure above to store different types of files. The first type is source data which is used to carry out the analysis. Source data is often stored in the data warehouse. The second type is temporary files which may be created and written to file to be used only during the analysis. The temporary files are typically written to scratch space allocated to the analytic servers. The third type is project files which can be used to store all the code, documents, and data snapshots necessary for the project. The project files can be stored on file share directories allocated to the project teams. The fourth type is user specific data which can be used to store configuration files and other user specific information.
Where RStudio Server stores data
R stores all its data in memory but writes to data warehouses, scratch space, project files, and home directories. RStudio Server writes specific information to the project directories and the home directories.
Analytic File Types | Location | Example |
1. Data tables | Data warehouse | Tables (create table dat as) |
2. Temporary files | Analytic server scratch space | Working files (write dat.csv) |
3. Project files | File share project directories | Code, documents, data snapshots (save dat.RData) |
4. User specific files | File share home directories | User configuration and session data (.rstudio) |
Project files
When you create a new project in RStudio Server the system creates a small configuration file called .Rproj
. RStudio Server looks for this file so it can identify the directory as a project. It also creates a file called .Rproj.user
that contains other settings about that project, such as project sharing in RStudio Workbench (previously RStudio Server Pro). You might also store your code in the project directory using version control system such as Git or SVN.
When you close a project down, you will be asked if you want to save your workspace image to an .RData
file. If you select yes, all of the data stored in memory will be written to your project directory. This feature is part of R and has always been included as a way to restore your workspace at some later date.
User specific files
The first time a user logs into RStudio Server a directory called .rstudio
is created in that users home directory. The .rstudio
contains information such as user history logs, dictionaries, add-ins, client state, R version settings, and much more. This information is necessary to support a good user experience.
RStudio Server contains a feature that will allow it to automatically save suspended sessions to file. If a session has been idle for too long, RStudio Server will suspend the session and save the image under ~/.rstudio/sessions
which is located on the home drive. This feature keeps the shared server environment clean while allowing analysts to resume their work at any time.
Notice that when users manually close a session they can save their workspace image into .RData
under the project directory while RStudio Server will save suspended sessions into .rstudio
under the home directory. If you want to turn off the suspended session feature you can set the following options in /etc/rstudio/rsession.conf
:
session-timeout-minutes=0
As of RStudio version 1.0 the session timeout can optionally be configured per user or for specific groups by adding session-timeout-minutes
in the /etc/rstudio/profiles
file, for example:
[@powerusers]
session-timeout-minutes=0
A typical analytic workflow in R will read information into memory, perform some analytic operation on the data, and read information back out to disk. If the user closes the session down manually, the workspace image will be written the project directory. If RStudio Server suspends the session automatically, the session will be written to the home directory.
Comments