1. download Jeffrey’s “R Hadoop Whirr” tutorial using git
$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git
2. copy Jeffrey’s hadoop property files
$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/
3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties
$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:
#git clone https://github.com/RevolutionAnalytics/RHadoop.git
# Replaced by Jongwook Woo
git clone git://github.com/RevolutionAnalytics/RHadoop.git
...
# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )
5. run Jeffrey’s script file using whirr to install R related codes to all nodes at EC2:
jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:
** building package indices
** testing if installed package can be loaded
* DONE (rmr2)
Making packages.html ... done
6. At EC2 main node, test R with Hadoop:
$ mkdir rtest
$ cd rtest/
$ git clone git://github.com/jeffreybreen/hadoop-R.git
or
$ git init
Initialized empty Git repository in /home/users/jongwook/rtest/.git/
$ git pull git://github.com/jeffreybreen/hadoop-R.git
Another example at https://github.com/ekohlwey/rhadoop-examples
$ git clone git://github.com/ekohlwey/rhadoop-examples.git
7. Download Airline schedule data:
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv
8. Make HDFS directories:
~/rtest$ hadoop fs -mkdir airline
~/rtest$ hadoop fs -ls /user
Found 2 items
drwxr-xr-x - hdfs supergroup 0 2013-04-20 20:32 /user/hive
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook
~/rtest$ hadoop fs -ls /user/jongwook
Found 1 items
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook/airline
~/rtest$ hadoop fs -mkdir airline/data
~/rtest$ hadoop fs -ls airline
Found 1 items
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 airline/data
~/rtest$ hadoop fs -mkdir airline/out
~/rtest$ hadoop fs -put air-2008.csv airline/data/
~/rtest$ hadoop fs -mkdir airline/data04
~/rtest$ hadoop fs -put 2004-1000.csv airline/data04/
9. Run Jeffrey’s R code using Hadoop rmr
$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/
~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr
$ cp deptdelay-rmr.R dd.R
$ vi dd.R
Replace the following:
#library(rmr)
library(rmr2)
...
#textinputformat = csvtextinputformat,
input.format = "csv", # csvtextinputformat,
...
#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))
10. Run:
$ ./dd.R
$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail
SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l
== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===
9. Run Jeffrey’s R code using Hadoop streaming
~/rtest$ cd hadoop-R
~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/
~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R
$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/
~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R
10. You will see the following result:
?
11. read the data
$ hadoop fs -cat airline/out/dept-delay-month/part-00000
Reference
4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011