If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr
. The dplyr
package has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr
to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr
and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI
, dplyr
, and odbc
. Note that the dplyr
package may also reference the dbplyr
package to help translate R into specific variants of SQL. You can use the odbc
package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr
package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr
package communicates with the Spark API to run SQL queries, and it also has a dplyr
backend. You can use sparklyr
to create a connect with Spark run queries:
library(sparklyr)
dbGetQuery(con, "SELECT * FROM mytable") # SQL
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
spark_disconnect(con)
A spark driver program is typically installed on a Hadoop "edge" node. If you are using Spark in cluster mode with YARN, you should make the connection to Spark via this driver node. If you are using RStudio Server or RStudio Workbench (previously RStudio Server Pro), it should also be installed on a driver node. RStudio and sparklyr need to be installed on a driver node in order to ensure the best experience with Spark and R.
Comments