Cloud Computing and Hadoop: R Hadoop example on EC2

Note: you could use the following site (https://github.com/hipic/r-hadoop) as an alternate to download the code and configuration files.

1. download Jeffrey’s “R Hadoop Whirr” tutorial using git

$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git

2. copy Jeffrey’s hadoop property files

$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/

3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties

$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:

#git clone https://github.com/RevolutionAnalytics/RHadoop.git

# Replaced by Jongwook Woo

git clone git://github.com/RevolutionAnalytics/RHadoop.git

...

# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )

5. run Jeffrey’s script file using whirr to install R related codes to all nodes at EC2:

jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:

** building package indices

** testing if installed package can be loaded

* DONE (rmr2)

Making packages.html ... done

6. At EC2 main node, test R with Hadoop:

$ mkdir rtest

$ cd rtest/

$ git clone git://github.com/jeffreybreen/hadoop-R.git

$ git init

Initialized empty Git repository in /home/users/jongwook/rtest/.git/

$ git pull git://github.com/jeffreybreen/hadoop-R.git

Another example at https://github.com/ekohlwey/rhadoop-examples

$ git clone git://github.com/ekohlwey/rhadoop-examples.git

7. Download Airline schedule data:

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv

8. Make HDFS directories:

~/rtest$ hadoop fs -mkdir airline

~/rtest$ hadoop fs -ls /user

Found 2 items

drwxr-xr-x - hdfs supergroup 0 2013-04-20 20:32 /user/hive

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook

~/rtest$ hadoop fs -ls /user/jongwook

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook/airline

~/rtest$ hadoop fs -mkdir airline/data

~/rtest$ hadoop fs -ls airline

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 airline/data

~/rtest$ hadoop fs -mkdir airline/out

~/rtest$ hadoop fs -put air-2008.csv airline/data/

~/rtest$ hadoop fs -mkdir airline/data04

~/rtest$ hadoop fs -put 2004-1000.csv airline/data04/

9. Run Jeffrey’s R code using Hadoop rmr

$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/

~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr

$ cp deptdelay-rmr.R dd.R

$ vi dd.R

Replace the following:

#library(rmr)

library(rmr2)

...

#textinputformat = csvtextinputformat,

input.format = "csv", # csvtextinputformat,

...

#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))

from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))

10. Run:

$ ./dd.R

$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail

SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l

== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===

9. Run Jeffrey’s R code using Hadoop streaming

~/rtest$ cd hadoop-R

~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/

~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/

~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

10. You will see the following result:

11. read the data

$ hadoop fs -cat airline/out/dept-delay-month/part-00000

Reference

1. http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/

2. http://www.rstudio.com/ide/download/server

3. https://github.com/jseidman

4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011