Sunday, April 21, 2013

R Hadoop example on EC2

 Note: you could use the following site (https://github.com/hipic/r-hadoop) as an alternate to download the code and configuration files.

1. download Jeffrey’s “R Hadoop Whirr”  tutorial using git

$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git



2. copy Jeffrey’s hadoop property files
$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/


3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties
$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr


4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:
#git clone https://github.com/RevolutionAnalytics/RHadoop.git
# Replaced by Jongwook Woo
git clone git://github.com/RevolutionAnalytics/RHadoop.git
...
# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )

5. run Jeffrey’s script file using whirr  to install R related codes to all nodes at EC2:
jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr


If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:
** building package indices
** testing if installed package can be loaded


* DONE (rmr2)
Making packages.html  ... done



6. At EC2 main node, test R with Hadoop:
$ mkdir rtest
$ cd rtest/
$ git clone git://github.com/jeffreybreen/hadoop-R.git


or
$ git init
Initialized empty Git repository in /home/users/jongwook/rtest/.git/
$ git pull git://github.com/jeffreybreen/hadoop-R.git



$ git clone git://github.com/ekohlwey/rhadoop-examples.git


7. Download Airline schedule data:
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv


~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv


8. Make HDFS directories:
~/rtest$ hadoop fs -mkdir airline
~/rtest$ hadoop fs -ls /user
Found 2 items
drwxr-xr-x   - hdfs     supergroup          0 2013-04-20 20:32 /user/hive
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook
~/rtest$ hadoop fs -ls /user/jongwook
Found 1 items
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook/airline
~/rtest$ hadoop fs -mkdir airline/data
~/rtest$ hadoop fs -ls airline
Found 1 items
drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 airline/data
~/rtest$ hadoop fs -mkdir airline/out
~/rtest$ hadoop fs -put  air-2008.csv airline/data/
~/rtest$ hadoop fs -mkdir airline/data04
~/rtest$ hadoop fs -put  2004-1000.csv airline/data04/


9. Run Jeffrey’s R code using Hadoop rmr
$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/
~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr
$ cp deptdelay-rmr.R dd.R
$ vi dd.R
Replace the following:


#library(rmr)
library(rmr2)


...
#textinputformat = csvtextinputformat,
           input.format = "csv", # csvtextinputformat,
...
#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))


10. Run:
$ ./dd.R
$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail
SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l


== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===
9. Run Jeffrey’s R code using Hadoop streaming
~/rtest$ cd hadoop-R
~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/
~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R


$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/
~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R


10. You will see the following result:
?


11. read the data
$ hadoop fs -cat airline/out/dept-delay-month/part-00000



Reference
4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011

5 comments:

Followers

Profile