1. download Jeffrey’s “R Hadoop Whirr” tutorial using git
$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git
2. copy Jeffrey’s hadoop property files
$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/
3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties
$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:
#git clone https://github.com/RevolutionAnalytics/RHadoop.git
# Replaced by Jongwook Woo
git clone git://github.com/RevolutionAnalytics/RHadoop.git
...
# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )
5. run Jeffrey’s script file using whirr to install R related codes to all nodes at EC2:
jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:
** building package indices
** testing if installed package can be loaded
* DONE (rmr2)
Making packages.html ... done
6. At EC2 main node, test R with Hadoop:
$ mkdir rtest
$ cd rtest/
$ git clone git://github.com/jeffreybreen/hadoop-R.git
or
$ git init
Initialized empty Git repository in /home/users/jongwook/rtest/.git/
$ git pull git://github.com/jeffreybreen/hadoop-R.git
Another example at https://github.com/ekohlwey/rhadoop-examples
$ git clone git://github.com/ekohlwey/rhadoop-examples.git
7. Download Airline schedule data:
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv
~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv
8. Make HDFS directories:
~/rtest$ hadoop fs -mkdir airline
~/rtest$ hadoop fs -ls /user
Found 2 items
drwxr-xr-x - hdfs supergroup 0 2013-04-20 20:32 /user/hive
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook
~/rtest$ hadoop fs -ls /user/jongwook
Found 1 items
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook/airline
~/rtest$ hadoop fs -mkdir airline/data
~/rtest$ hadoop fs -ls airline
Found 1 items
drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 airline/data
~/rtest$ hadoop fs -mkdir airline/out
~/rtest$ hadoop fs -put air-2008.csv airline/data/
~/rtest$ hadoop fs -mkdir airline/data04
~/rtest$ hadoop fs -put 2004-1000.csv airline/data04/
9. Run Jeffrey’s R code using Hadoop rmr
$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/
~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr
$ cp deptdelay-rmr.R dd.R
$ vi dd.R
Replace the following:
#library(rmr)
library(rmr2)
...
#textinputformat = csvtextinputformat,
input.format = "csv", # csvtextinputformat,
...
#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))
10. Run:
$ ./dd.R
$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail
SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l
== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===
9. Run Jeffrey’s R code using Hadoop streaming
~/rtest$ cd hadoop-R
~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/
~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R
$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/
~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R
10. You will see the following result:
?
11. read the data
$ hadoop fs -cat airline/out/dept-delay-month/part-00000
Reference
4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011
I actually enjoyed reading through this posting.Many thanks.
ReplyDeleteAmazon Cloud Service Consultant in Chennai
As we know, Big data platform managed service is the future of the industries these days, this article helps me to figure out which language I need to learn to pursue the future in this field.
ReplyDelete
ReplyDeleteYour info is really amazing with impressive content..Excellent blog with informative concept.
DevOps Online Training institute
DevOps Online Training in Hyderabad
DevOps Course in Hyderabad
I am Here to Get Learn Good Stuff About DevOps, Thanks For Sharing
ReplyDeleteDevOps Training in Bangalore | Certification | Online Training Course institute | DevOps Training in Hyderabad | Certification | Online Training Course institute | DevOps Training in Coimbatore | Certification | Online Training Course institute | DevOps Online Training | Certification | Devops Training Online
This is good information and really helpful for the people who need information about this.
ReplyDeleteoracle training in chennai
oracle training institute in chennai
oracle training in bangalore
oracle training in hyderabad
oracle training
hadoop training in chennai
hadoop training in bangalore