Cloud Computing and Hadoop

Thursday, June 04, 2015

How to set up Spark on EC2

setup AWS keys

Follow how to setup Amzon AWS at [5]

export AWS_ACCESS_KEY_ID=xxxxxxxxx

export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx

In case you need a spark [2]:

The latest release of Spark is Spark 1.3.1, released on April 17, 2015 (release notes) (git tag)

Choose a Spark release: 1.3.1 (Apr 17 2015)
Choose a package type
Choose a download type: Select Apache MirrorDirect Download
Download Spark: spark-1.3.1-bin-hadoop2.6.tgz

generate a keypairs at AWS as shown [1]

download it to the local, especially at ec2 directory; Any directory for ampcamp?

If not, SSH error will show up,
for example, under the following directory: osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2

chmod 400 [key.pem] in amp3 but not working for EC2

or chmod 600 [key.pem]

Run instances

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2$ ./spark-ec2 --key-pair=ampcamp3 --identity-file=~/.ssh/ampcamp3.pem --region=us-east-1 --zone=us-east-1a --copy launch my-spark-cluster

notes: -s (or --slaves) for number of slaves

osboxes@osboxes:~/proj/ampcamp3/training-scripts$ ./spark-ec2 -i ampcamp3.pem -k ampcamp3 -z us-east-1b -s 3 --copy launch amplab-training

4. Log in to master node

You need to go to AWS to find the master instance. Select the instance and choose "Connect" that shows the shell command to connect it remotely

And, http://master_node:8080 should be a spark page

5. Run HDFS at EC2 when HDFS has your data file that Spark needs to read

root@ip-10-232-51-182$ cd /root/ephemeral-hdfs/bin

root@ip-10-232-51-182 bin]$ ./start-dfs.sh

root@ip-10-232-51-182 bin]$ ./hadoop fs -ls /
root@ip-10-232-51-182 bin]$ ./hadoop fs -put samplefile /user/myname

Note: security group for master node needs to open for TCP 7077

6. Run any example at the master node [4, 7]

6.a wordcount for samplefile

cd ~/spark-1.3.1-bin-hadoop2.6

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit
spark> text_file = spark.textFile("/user/myname/samplefile")
spark> counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)spark> counts.saveAsTextFile("hdfs://...")

6.b Spark Example

cd ~/spark-1.3.1-bin-hadoop2.6

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://54.205.231.93:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
~/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar 1000

./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
--executor-memory 20G \
--total-executor-cores 100 \
~/spark/lib/spark-examples-1.3.1-hadoop1.0.4.jar 1000

Reference

http://ampcamp.berkeley.edu/3/exercises/launching-a-bdas-cluster-on-ec2.html
https://spark.apache.org/downloads.html
https://spark.apache.org/docs/latest/ec2-scripts.html
https://spark.apache.org/docs/latest/submitting-application
http://dal-cloudcomputing.blogspot.com/2013/04/create-aws-account-and-access-keys.html
https://spark.apache.org/examples.html

Tuesday, July 09, 2013

Solr on CDH4 using Cloudera Manager

Install Cloudera Manager from an AWS instance

You can launch an AWS instance at CentOS 6.3 and ssh to the instance. Then, download the Cloudera Manager 4.5 installer and execute it on the remote instance:

$ wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
$ chmod +x cloudera-manager-installer.bin
$ sudo ./cloudera-manager-installer.bin

(1) Follow the command based installation accepting licenses.

(2) Open and go to Cloudera Manager's Web UI, which might be local URLbut you may use the global URL, to launch the services.

(3) All Services > Add Cluster > Continue on Coudera Manager Express Wizard > CentOS 6.3; m1.xlarge; ...

Note: >= m1.large recommended

Stop unneccessay services

Stop HBase, Hive, Oozie, Sqoop

All systems are located at 'usr/lib/'

For example, /usr/lib/solr, /usr/lib/zookeeper, ...

Optional: Activate Solr service

After launching Cloudera Manager and its instances as shown at whirr_cm above, go to 'Host' > 'Parcels' tab of the Cloudera Manager's Web UI. Then, you can download the latest available CDH, Solr, Impala.

Download "SOLR 0.9.1-1.cdh4.3.0.p0.275" > Distribute > Activate > Restart the current Cluster

Optional: download CDH 4.3.0-1

Download "CDH 4.3.0-1.cdh4.3.0.p0.22" > Distribute > Activate > Restart the current Cluster

Note: Restarting the cluster will take several minutes

Add Solr service

Actions (of Cluster 1 - CDH4) > Add a Service > Solr > Zookeeper as a dependency

Open a Web UI of Hue

Default login/pwd is admin You can see Solr. Select it

Update Solr conf at a zookeeper node

You can see a solr configuration file as '/etc/default/solr' and update it with as follows:

sudo vi /etc/default/solr

Note: it may not recognize 'localhost' so that use '127.0.0.1' alternatively

Create the /solr directory in HDFS:

$ sudo -u hdfs hadoop fs -mkdir /solr
$ sudo -u hdfs hadoop fs -chown solr /solr

Create a collection

You change to root account and need to add solr to zookeeper. From now on, I run shell commands as root user.

$ sudo su
$ solrctl init

$ solrctl init --force

Then, at Cloudera Manager's Web UI, restart solr service.

Run the following commands to create a collection at a zookeeper node

$ solrctl instancedir --generate $HOME/solr_configs
$ solrctl instancedir --create collection $HOME/solr_configs
$ solrctl collection --create collection -s 1

While running 'solrctl collection ...', you may go to /var/log/solr and check out if the solr runs well without any error:

$ tail -f solr-cmf-solr1-SOLR_SERVER-ip-10-138-xx-xx.ec2.internal.log.out

Upload an example data to solr

$ cd /usr/share/doc/solr-doc-4.3.0+52/example/exampledocs/
$ java -Durl=http://127.0.0.1:8983/solr/collection/update -jar post.jar *.xml
SimplePostTool version 1.5
Posting files to base url http://127.0.0.1:8983/solr/collection/update using content-type application/xml..
POSTing file gb18030-example.xml
POSTing file hd.xml
POSTing file ipod_other.xml
...
POSTing file utf8-example.xml
POSTing file vidcard.xml
14 files indexed.
COMMITting Solr index changes to http://127.0.0.1:8983/solr/collection/update..
Time spent: 0:00:00.818

Query using Hue Web UI

Open Hue Web UI at Cloudera Manager's Hue service and select solr tab.

1. Make sure to import collections - core may not be needed.

2. select "Search page" link at the top right of the solr web UI page.

3. As default, the page shows 1-15 of 32 results.

4. Type in 'photo' at a search box ans will show 1 -2 of 2 results.

Customize the view of Solr Web UI

Select 'Customize this collection' that will present Visual Editor for view.

Note: you can see the same content from https://github.com/hipic/cdh-solr

References

[1]. http://github.com/hipic/whirr_cm

[2]. http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

[3]. Managing Clusters with Cloudera Manager,http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/PDF/Managing-Clusters-with-Cloudera-Manager.pdf

[4]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-Installation-Guide

[5]. Cloudera Search User Guide, http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide

[6]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/PDF/Cloudera-Search-Installation-Guide.pdf

Monday, July 01, 2013

Hadoop points at file:/// not hdfs:///

When I install CDH 4.5 using Whirr 0.8.2, hadoop points at file:/// not hdfs:///. Therefore, ‘ssh’ to the client node that is the last ssh of whirr. Then, I need to update Hadoop’s core-site.xml by adding its namenode, which is at /etc/hadoop/conf

fs.defaultFS
hdfs://hadoop-namenode:8020

For example, if a name node’s local ip address is 10.80.221.129

fs.defaultFS
hdfs://10.80.221.129:8020

Now you can see the HDFS directories and files. If you ssh to other nodes, you have to change other nodes’ core-site.xml too.

Reference

1. http://stackoverflow.com/questions/12391226/hadoop-hdfs-points-to-file-not-hdfs?rq=1

2. http://stackoverflow.com/questions/16008486/after-installing-hadoop-via-cloudera-manager-4-5-hdfs-only-points-to-the-local-f

Sunday, April 21, 2013

R Hadoop example on EC2

Note: you could use the following site (https://github.com/hipic/r-hadoop) as an alternate to download the code and configuration files.

1. download Jeffrey’s “R Hadoop Whirr” tutorial using git

$ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git

2. copy Jeffrey’s hadoop property files

$ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/

3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties

$ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:

#git clone https://github.com/RevolutionAnalytics/RHadoop.git

# Replaced by Jongwook Woo

git clone git://github.com/RevolutionAnalytics/RHadoop.git

...

# add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )

5. run Jeffrey’s script file using whirr to install R related codes to all nodes at EC2:

jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr

If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:

** building package indices

** testing if installed package can be loaded

* DONE (rmr2)

Making packages.html ... done

6. At EC2 main node, test R with Hadoop:

$ mkdir rtest

$ cd rtest/

$ git clone git://github.com/jeffreybreen/hadoop-R.git

$ git init

Initialized empty Git repository in /home/users/jongwook/rtest/.git/

$ git pull git://github.com/jeffreybreen/hadoop-R.git

Another example at https://github.com/ekohlwey/rhadoop-examples

$ git clone git://github.com/ekohlwey/rhadoop-examples.git

7. Download Airline schedule data:

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv

~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv

8. Make HDFS directories:

~/rtest$ hadoop fs -mkdir airline

~/rtest$ hadoop fs -ls /user

Found 2 items

drwxr-xr-x - hdfs supergroup 0 2013-04-20 20:32 /user/hive

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook

~/rtest$ hadoop fs -ls /user/jongwook

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 /user/jongwook/airline

~/rtest$ hadoop fs -mkdir airline/data

~/rtest$ hadoop fs -ls airline

Found 1 items

drwxr-xr-x - jongwook supergroup 0 2013-04-20 21:30 airline/data

~/rtest$ hadoop fs -mkdir airline/out

~/rtest$ hadoop fs -put air-2008.csv airline/data/

~/rtest$ hadoop fs -mkdir airline/data04

~/rtest$ hadoop fs -put 2004-1000.csv airline/data04/

9. Run Jeffrey’s R code using Hadoop rmr

$ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/

~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr

$ cp deptdelay-rmr.R dd.R

$ vi dd.R

Replace the following:

#library(rmr)

library(rmr2)

...

#textinputformat = csvtextinputformat,

input.format = "csv", # csvtextinputformat,

...

#from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))

from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))

10. Run:

$ ./dd.R

$ hadoop fs -cat dept-delay-month-orig/part-00001 | tail

SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l

== The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===

9. Run Jeffrey’s R code using Hadoop streaming

~/rtest$ cd hadoop-R

~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/

~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

$ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/

~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R

10. You will see the following result:

11. read the data

$ hadoop fs -cat airline/out/dept-delay-month/part-00000

Reference

1. http://jeffreybreen.wordpress.com/2012/03/10/big-data-step-by-step-slides/

2. http://www.rstudio.com/ide/download/server

3. https://github.com/jseidman

4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011

Cloud Computing and Hadoop

Blog Archive

Thursday, June 04, 2015

How to set up Spark on EC2

generate a keypairs at AWS as shown [1]

Run instances

Tuesday, July 09, 2013

Solr on CDH4 using Cloudera Manager

Install Cloudera Manager from an AWS instance

(1) Follow the command based installation accepting licenses.

(2) Open and go to Cloudera Manager's Web UI, which might be local URLbut you may use the global URL, to launch the services.

(3) All Services > Add Cluster > Continue on Coudera Manager Express Wizard > CentOS 6.3; m1.xlarge; ...

Note: >= m1.large recommended

Stop unneccessay services

All systems are located at 'usr/lib/'

Optional: Activate Solr service

Optional: download CDH 4.3.0-1

Add Solr service

Open a Web UI of Hue

Update Solr conf at a zookeeper node

Create the /solr directory in HDFS:

Create a collection

Query using Hue Web UI

1. Make sure to import collections - core may not be needed.

2. select "Search page" link at the top right of the solr web UI page.

3. As default, the page shows 1-15 of 32 results.

4. Type in 'photo' at a search box ans will show 1 -2 of 2 results.

Customize the view of Solr Web UI

References

[1]. http://github.com/hipic/whirr_cm

[2]. http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

[3]. Managing Clusters with Cloudera Manager,http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/PDF/Managing-Clusters-with-Cloudera-Manager.pdf

[4]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-Installation-Guide

[5]. Cloudera Search User Guide, http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide

[6]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/PDF/Cloudera-Search-Installation-Guide.pdf

Monday, July 01, 2013

Hadoop points at file:/// not hdfs:///

Sunday, April 21, 2013

R Hadoop example on EC2

Followers

Profile

Search This Blog

Labels

Stats