Thursday, June 04, 2015

How to set up Spark on EC2

  1. setup AWS keys 
Follow how to setup Amzon AWS at [5]


export AWS_ACCESS_KEY_ID=xxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx

In case you need a spark [2]:
The latest release of Spark is Spark 1.3.1, released on April 17, 2015 (release notes) (git tag)
  1. Choose a Spark release: 1.3.1 (Apr 17 2015)
  2. Choose a package type
  3. Choose a download type: Select Apache MirrorDirect Download

  1. generate a keypairs at AWS as shown [1]

    1. download it to the local, especially at ec2 directory; Any directory for ampcamp?
      1. If not, SSH error will show up,
      2. for example, under the following directory: osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2
    2. chmod 400 [key.pem]  in amp3 but not working for EC2
      1. or chmod 600 [key.pem] 

  1. Run instances

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2$ ./spark-ec2 --key-pair=ampcamp3 --identity-file=~/.ssh/ampcamp3.pem --region=us-east-1 --zone=us-east-1a --copy launch my-spark-cluster

notes: -s (or --slaves) for number of slaves
osboxes@osboxes:~/proj/ampcamp3/training-scripts$ ./spark-ec2 -i ampcamp3.pem -k ampcamp3 -z us-east-1b -s 3 --copy launch amplab-training

4. Log in to master node

You need to go to AWS to find the master instance. Select the instance and choose "Connect" that shows the shell command to connect it remotely

And, http://master_node:8080 should be a spark page


5. Run HDFS at EC2 when HDFS has your data file that Spark needs to read
    root@ip-10-232-51-182$ cd /root/ephemeral-hdfs/bin
    root@ip-10-232-51-182 bin]$ ./start-dfs.sh
    root@ip-10-232-51-182 bin]$ ./hadoop fs -ls /
    root@ip-10-232-51-182 bin]$ ./hadoop fs -put samplefile /user/myname

    Note: security group for master node needs to open for TCP 7077

    6. Run any example at the master node [4, 7]

    6.a wordcount for samplefile

    cd ~/spark-1.3.1-bin-hadoop2.6

    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit
    spark> text_file = spark.textFile("/user/myname/samplefile")
    spark> counts = text_file.flatMap(lambda line: line.split(" "))
                 .map(lambda word: (word, 1))
                 .reduceByKey(lambda a, b: a + b)
    spark> counts.saveAsTextFile("hdfs://...")     

    6.b Spark Example
    cd ~/spark-1.3.1-bin-hadoop2.6
    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master spark://54.205.231.93:7077 \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar   1000
    ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master local[8] \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark/lib/spark-examples-1.3.1-hadoop1.0.4.jar 1000




    Reference
    1. https://spark.apache.org/docs/latest/submitting-application
    2. http://dal-cloudcomputing.blogspot.com/2013/04/create-aws-account-and-access-keys.html
    3. https://spark.apache.org/examples.html

    Tuesday, July 09, 2013

    Solr on CDH4 using Cloudera Manager

    Install Cloudera Manager from an AWS instance

    You can launch an AWS instance at CentOS 6.3 and ssh to the instance. Then, download the Cloudera Manager 4.5 installer and execute it on the remote instance:
    $ wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
    $ chmod +x cloudera-manager-installer.bin
    $ sudo ./cloudera-manager-installer.bin
    

    (1) Follow the command based installation accepting licenses.

    (2) Open and go to Cloudera Manager's Web UI, which might be local URLbut you may use the global URL, to launch the services.

    (3) All Services > Add Cluster > Continue on Coudera Manager Express Wizard > CentOS 6.3; m1.xlarge; ...

    Note: >= m1.large recommended

    Stop unneccessay services

    Stop HBase, Hive, Oozie, Sqoop

    All systems are located at 'usr/lib/'

    For example, /usr/lib/solr, /usr/lib/zookeeper, ...

    Optional: Activate Solr service

    After launching Cloudera Manager and its instances as shown at whirr_cm above, go to 'Host' > 'Parcels' tab of the Cloudera Manager's Web UI. Then, you can download the latest available CDH, Solr, Impala.
    Download "SOLR 0.9.1-1.cdh4.3.0.p0.275" > Distribute > Activate > Restart the current Cluster

    Optional: download CDH 4.3.0-1

    Download "CDH 4.3.0-1.cdh4.3.0.p0.22" > Distribute > Activate > Restart the current Cluster
    Note: Restarting the cluster will take several minutes

    Add Solr service

    Actions (of Cluster 1 - CDH4) > Add a Service > Solr > Zookeeper as a dependency

    Open a Web UI of Hue

    Default login/pwd is admin You can see Solr. Select it

    Update Solr conf at a zookeeper node

    You can see a solr configuration file as '/etc/default/solr' and update it with as follows:
    sudo vi /etc/default/solr 
    
    Note: it may not recognize 'localhost' so that use '127.0.0.1' alternatively

    Create the /solr directory in HDFS:

    $ sudo -u hdfs hadoop fs -mkdir /solr
    $ sudo -u hdfs hadoop fs -chown solr /solr
    

    Create a collection

    You change to root account and need to add solr to zookeeper. From now on, I run shell commands as root user.
    $ sudo su
    $ solrctl init
    
    or
    $ solrctl init --force
    
    Then, at Cloudera Manager's Web UI, restart solr service.
    Run the following commands to create a collection at a zookeeper node
    $ solrctl instancedir --generate $HOME/solr_configs
    $ solrctl instancedir --create collection $HOME/solr_configs
    $ solrctl collection --create collection -s 1
    
    While running 'solrctl collection ...', you may go to /var/log/solr and check out if the solr runs well without any error:
    $ tail -f solr-cmf-solr1-SOLR_SERVER-ip-10-138-xx-xx.ec2.internal.log.out 
    
    Upload an example data to solr
    $ cd /usr/share/doc/solr-doc-4.3.0+52/example/exampledocs/
    $ java -Durl=http://127.0.0.1:8983/solr/collection/update -jar post.jar *.xml
    SimplePostTool version 1.5
    Posting files to base url http://127.0.0.1:8983/solr/collection/update using content-type application/xml..
    POSTing file gb18030-example.xml
    POSTing file hd.xml
    POSTing file ipod_other.xml
    ...
    POSTing file utf8-example.xml
    POSTing file vidcard.xml
    14 files indexed.
    COMMITting Solr index changes to http://127.0.0.1:8983/solr/collection/update..
    Time spent: 0:00:00.818
    

    Query using Hue Web UI

    Open Hue Web UI at Cloudera Manager's Hue service and select solr tab.
    1. Make sure to import collections - core may not be needed.
    2. select "Search page" link at the top right of the solr web UI page.
    3. As default, the page shows 1-15 of 32 results.
    4. Type in 'photo' at a search box ans will show 1 -2 of 2 results.

    Customize the view of Solr Web UI

    Select 'Customize this collection' that will present Visual Editor for view.
    Note: you can see the same content from https://github.com/hipic/cdh-solr

    References

    [1]. http://github.com/hipic/whirr_cm

    [2]. http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

    [3]. Managing Clusters with Cloudera Manager,http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/PDF/Managing-Clusters-with-Cloudera-Manager.pdf

    [4]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-Installation-Guide

    [5]. Cloudera Search User Guide, http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide

    [6]. Cloudera Search Installation Guide,http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/PDF/Cloudera-Search-Installation-Guide.pdf

    Monday, July 01, 2013

    Hadoop points at file:/// not hdfs:///

    When I install CDH 4.5 using Whirr 0.8.2, hadoop points at file:/// not hdfs:///. Therefore, ‘ssh’ to the client node that is the last ssh of whirr. Then, I need to update Hadoop’s core-site.xml by adding its namenode, which is at /etc/hadoop/conf


     fs.defaultFS
     hdfs://hadoop-namenode:8020

    For example, if a name node’s local ip address is 10.80.221.129

     fs.defaultFS
     hdfs://10.80.221.129:8020

    Now you can see the HDFS directories and files. If you ssh to other nodes, you have to change other nodes’ core-site.xml too.

    Reference
    2. http://stackoverflow.com/questions/16008486/after-installing-hadoop-via-cloudera-manager-4-5-hdfs-only-points-to-the-local-f

    Sunday, April 21, 2013

    R Hadoop example on EC2

     Note: you could use the following site (https://github.com/hipic/r-hadoop) as an alternate to download the code and configuration files.

    1. download Jeffrey’s “R Hadoop Whirr”  tutorial using git

    $ git clone git://github.com/jeffreybreen/tutorial-201209-TDWI-big-data.git



    2. copy Jeffrey’s hadoop property files
    $ cp ~/tutorial-201209-TDWI-big-data/config/whirr-ec2/hadoop-ec2.properties ./recipes/


    3. run Cloudera Hadoop (CDH) using Whirr with Jeffrey’s hadoop-ec2.properties
    $ bin/whirr launch-cluster --config recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr


    4. open and replace the following of the Jeffrey’s script file ‘install-r-CDH4.sh’:
    #git clone https://github.com/RevolutionAnalytics/RHadoop.git
    # Replaced by Jongwook Woo
    git clone git://github.com/RevolutionAnalytics/RHadoop.git
    ...
    # add 'plyr'install.packages( c('Rcpp','RJSONIO', 'digest', 'functional','stringr', 'plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile') )

    5. run Jeffrey’s script file using whirr  to install R related codes to all nodes at EC2:
    jongwook@ubuntu:~/tutorial-201209-TDWI-big-data/config$ ~/apache/whirr-0.8.1/bin/whirr run-script --script install-r-CDH4.sh --config ~/apache/whirr-0.8.1/recipes/hadoop-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr


    If successful, you will see the following and rmr2 is installed at /usr/lib64/R/library/:
    ** building package indices
    ** testing if installed package can be loaded


    * DONE (rmr2)
    Making packages.html  ... done



    6. At EC2 main node, test R with Hadoop:
    $ mkdir rtest
    $ cd rtest/
    $ git clone git://github.com/jeffreybreen/hadoop-R.git


    or
    $ git init
    Initialized empty Git repository in /home/users/jongwook/rtest/.git/
    $ git pull git://github.com/jeffreybreen/hadoop-R.git



    $ git clone git://github.com/ekohlwey/rhadoop-examples.git


    7. Download Airline schedule data:
    ~/rtest$ curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 | bzcat | head -1000 > air-2008.csv


    ~/rtest$ curl http://stat-computing.org/dataexpo/2009/2004.csv.bz2 | bzcat | head -1000 > 2004-1000.csv


    8. Make HDFS directories:
    ~/rtest$ hadoop fs -mkdir airline
    ~/rtest$ hadoop fs -ls /user
    Found 2 items
    drwxr-xr-x   - hdfs     supergroup          0 2013-04-20 20:32 /user/hive
    drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook
    ~/rtest$ hadoop fs -ls /user/jongwook
    Found 1 items
    drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 /user/jongwook/airline
    ~/rtest$ hadoop fs -mkdir airline/data
    ~/rtest$ hadoop fs -ls airline
    Found 1 items
    drwxr-xr-x   - jongwook supergroup          0 2013-04-20 21:30 airline/data
    ~/rtest$ hadoop fs -mkdir airline/out
    ~/rtest$ hadoop fs -put  air-2008.csv airline/data/
    ~/rtest$ hadoop fs -mkdir airline/data04
    ~/rtest$ hadoop fs -put  2004-1000.csv airline/data04/


    9. Run Jeffrey’s R code using Hadoop rmr
    $ export LD_LIBRARY_PATH=/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64/
    ~/rtest$ cd hadoop-R/airline/src/deptdelay_by_month/R/rmr
    $ cp deptdelay-rmr.R dd.R
    $ vi dd.R
    Replace the following:


    #library(rmr)
    library(rmr2)


    ...
    #textinputformat = csvtextinputformat,
               input.format = "csv", # csvtextinputformat,
    ...
    #from.dfs(deptdelay("/data/airline/1987.csv", "/dept-delay-month"))
    from.dfs(deptdelay("airline/data04/2004-1000.csv", "dept-delay-month-orig"))


    10. Run:
    $ ./dd.R
    $ hadoop fs -cat dept-delay-month-orig/part-00001 | tail
    SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable�'W5�l


    == The following seems working at rmr-1.3.1 but not at rmr-2. But, you may try it ===
    9. Run Jeffrey’s R code using Hadoop streaming
    ~/rtest$ cd hadoop-R
    ~/rtest/hadoop-R$ cd airline/src/deptdelay_by_month/R/streaming/
    ~/rtest/hadoop-R$$ $ /usr/lib/hadoop-0.20-mapreduce/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R


    $ export HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce/
    ~/rtest$ $ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.0.jar -input airline/data04 -output airline/out/dept-delay-month -mapper map.R -reducer reduce.R -file map.R -file reduce.R


    10. You will see the following result:
    ?


    11. read the data
    $ hadoop fs -cat airline/out/dept-delay-month/part-00000



    Reference
    4. http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011

    Followers

    Profile