Cloud Computing and Hadoop: April 2013

Sunday, April 21, 2013

R Hadoop example on EC2

Note: you could use the following site (https://github.com/hipic/r-hadoop) as an alternate to download the code and configuration files.

1. download Jeffrey’s “R Hadoop Whirr” tutorial using git

$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git

2. copy Jeffrey’s hadoop property files

$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/

3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties

$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:

#git clone https://github.com/RevolutionAnalytics/RHadoop.git

# Replaced by Jongwook Woo

git clone git://github.com/RevolutionAnalytics/RHadoop.git

...

# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )

5. run Jeffrey’s script file using whirr to install R related codes to all nodes at EC2:

jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:

** building package indices

** testing if installed package can be loaded

* DONE (rmr2)

Making packages.html ... done

6. At EC2 main node, test R with Hadoop:

$ mkdir rtest

$ cd rtest/

$ git clone git://github.com/jeffreybreen/hadoop-R.git

$ git init

Initialized empty Git repository in /home/users/jongwook/rtest/.git/

$ git pull git://github.com/jeffreybreen/hadoop-R.git

Another example at https://github.com/ekohlwey/rhadoop-examples

$ git clone git://github.com/ekohlwey/rhadoop-examples.git

7. Download Airline schedule data:

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv

8. Make HDFS directories:

~/rtest$ hadoop fs -mkdir airline

~/rtest$ hadoop fs -ls /user

Found 2 items

drwxr-xr-x - hdfs supergroup 0 2013-04-20 20:32 /user/hive

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook

~/rtest$ hadoop fs -ls /user/jongwook

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook/airline

~/rtest$ hadoop fs -mkdir airline/data

~/rtest$ hadoop fs -ls airline

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 airline/data

~/rtest$ hadoop fs -mkdir airline/out

~/rtest$ hadoop fs -put air-2008.csv airline/data/

~/rtest$ hadoop fs -mkdir airline/data04

~/rtest$ hadoop fs -put 2004-1000.csv airline/data04/

9. Run Jeffrey’s R code using Hadoop rmr

$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/

~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr

$ cp deptdelay-rmr.R dd.R

$ vi dd.R

Replace the following:

#library(rmr)

library(rmr2)

...

#textinputformat = csvtextinputformat,

input.format = "csv", # csvtextinputformat,

...

#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))

from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))

10. Run:

$ ./dd.R

$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail

SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l

== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===

9. Run Jeffrey’s R code using Hadoop streaming

~/rtest$ cd hadoop-R

~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/

~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/

~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

10. You will see the following result:

11. read the data

$ hadoop fs -cat airline/out/dept-delay-month/part-00000

Reference

1. http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/

2. http://www.rstudio.com/ide/download/server

3. https://github.com/jseidman

4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011

Sunday, April 07, 2013

Run a Cloudera Hadoop cluster using Whirr on AWS EC2

Note: tested at Mac OS X 10.6.8

Open your Mac terminal

copy and the paste the following with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY keys:

export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXXXXXXXxx

export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

export WHIRR_PROVIDER=aws-ec2

export WHIRR_IDENTITY=$AWS_ACCESS_KEY_ID

export WHIRR_CREDENTIAL=$AWS_SECRET_ACCESS_KEY

At your terminal, type in the following

Download and install whirr

curl -O http://www.apache.org/dist/whirr/whirr-0.8.1/whirr-0.8.1.tar.gz
tar zxf whirr-0.8.1.tar.gz; cd whirr-0.8.1

Generate ASW private key

ssh-keygen -t rsa -P '' -f ~/.ssh/id_mac_rsa_whirr

Start CDH (Cloudera’s Distribution including Hadoop) remotely from your local machine

bin/whirr launch-cluster --config recipes/hbase-cdh.properties --private-key-file ~/.ssh/id_mac_rsa_whirr

If you want to stop CDH server of AWS, use the following command:

bin/whirr destroy-cluster --config recipes/hbase-cdh.properties --private-key-file ~/.ssh/id_mac_rsa_whirr

You can log into instances using the following ssh commands; use the last one as zookeeper/namenode to log into AWS EC2 server

At the remote SSH shell of AWS, make sure if AWS EC2 has both HBase and Hadoop. Do the same commands as below and compare the results:

You may skip the following if you don't have "hadoop-0.20.2-examples.jar". Now run Hadoop pi demo to test Hadoop; Need to have hadoop-0.20.2-examples.jar given by the instructor :

jwoo5@ip-10-141-164-35:~$ cd
jwoo5@ip-10-141-164-35:~$ hadoop jar hadoop-0.20.2-examples.jar pi 20 1000

...

FYI - skip this: Normally need to setup path and CLASSPATH at EC2 server to run hbase and hadoop codes. However, CDH seems have them during the installation.

export HADOOP_HOME=/usr/lib/hadoop

export HBASE_HOME=/usr/lib/hbase

#export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH

# CLASSPATH for HADOOP

export CLASSPATH=$HADOOP_HOME/hadoop-annotations.jar:$HADOOP_HOME/hadoop-auth.:$CLASSPATH

export CLASSPATH=$HADOOP_HOME/hadoop-common.jar:$HADOOP_HOME/hadoop-common-2.0.0-cdh4.2.0-tests.jar:$CLASSPATH

# CLASSPATH for HBASE

...

Run HBase (Hadoop NoSQL DB) demo:

12 HDFS commands test

hadoop fs -[command]

ls : list files and folders at “folder”

copyFromLocal : copy “local” file to “hdfs” file

mv: move : move “src” file to “dest” file

cat : display the content of “file”

...

References

How to Set up Hadoop and HBase together using Whirr, http://dal-cloudcomputing.blogspot.com/2011/06/how-to-set-up-hadoop-and-hbase-together.html

Getting Started with AWS, http://docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/getstarted.html

AWS Sign Up, http://aws.amazon.com/

Whirr Quick Start Guide, https://cwiki.apache.org/confluence/display/WHIRR/Quick+Start+Guide

Whirr in 5 Minutes, http://whirr.apache.org/docs/0.8.1/whirr-in-5-minutes.html

Create AWS Account and the access keys

Open and Sign up at AWS (http://aws.amazon.com/), That is, create your AWS account

When you sign up, you need to enter your card number and telephone verification.

Note: if this your first time to have the account, you will have the free account for 1 year.

Go to http://aws.amazon.com/awscredits/ and enter your promotion code that the instructor gives you ($100 amount)

note: if your usage is over $100, AWS will charge you at your credit card. Thus, after using it, terminate any server. And, the promotion usage disable the free account option - not sure.

Click “Products & Services” or go to http://aws.amazon.com/products/

Click “Amazon Elastic Compute Cloud (EC2) ” at http://aws.amazon.com/ec2/

In the right top, click on “Security Credentials”, https://portal.aws.amazon.com/gp/aws/securityCredentials, to create your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for remote connection from your local computer to AWS.

At Access Credentials, click the tab “Access Keys” and create AWS_ACCESS_KEY_ID. Then, right next to the key, you can see “secret access key” as AWS_SECRET_ACCESS_KEY and click on “show”.

Cloud Computing and Hadoop

Blog Archive

Sunday, April 21, 2013

R Hadoop example on EC2

Sunday, April 07, 2013

Run a Cloudera Hadoop cluster using Whirr on AWS EC2

Create AWS Account and the access keys

Followers

Profile

Search This Blog

Labels

Stats