Thursday, June 04, 2015

How to set up Spark on EC2

  1. setup AWS keys 
Follow how to setup Amzon AWS at [5]


export AWS_ACCESS_KEY_ID=xxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx

In case you need a spark [2]:
The latest release of Spark is Spark 1.3.1, released on April 17, 2015 (release notes) (git tag)
  1. Choose a Spark release: 1.3.1 (Apr 17 2015)
  2. Choose a package type
  3. Choose a download type: Select Apache MirrorDirect Download

  1. generate a keypairs at AWS as shown [1]

    1. download it to the local, especially at ec2 directory; Any directory for ampcamp?
      1. If not, SSH error will show up,
      2. for example, under the following directory: osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2
    2. chmod 400 [key.pem]  in amp3 but not working for EC2
      1. or chmod 600 [key.pem] 

  1. Run instances

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2$ ./spark-ec2 --key-pair=ampcamp3 --identity-file=~/.ssh/ampcamp3.pem --region=us-east-1 --zone=us-east-1a --copy launch my-spark-cluster

notes: -s (or --slaves) for number of slaves
osboxes@osboxes:~/proj/ampcamp3/training-scripts$ ./spark-ec2 -i ampcamp3.pem -k ampcamp3 -z us-east-1b -s 3 --copy launch amplab-training

4. Log in to master node

You need to go to AWS to find the master instance. Select the instance and choose "Connect" that shows the shell command to connect it remotely

And, http://master_node:8080 should be a spark page


5. Run HDFS at EC2 when HDFS has your data file that Spark needs to read
    root@ip-10-232-51-182$ cd /root/ephemeral-hdfs/bin
    root@ip-10-232-51-182 bin]$ ./start-dfs.sh
    root@ip-10-232-51-182 bin]$ ./hadoop fs -ls /
    root@ip-10-232-51-182 bin]$ ./hadoop fs -put samplefile /user/myname

    Note: security group for master node needs to open for TCP 7077

    6. Run any example at the master node [4, 7]

    6.a wordcount for samplefile

    cd ~/spark-1.3.1-bin-hadoop2.6

    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit
    spark> text_file = spark.textFile("/user/myname/samplefile")
    spark> counts = text_file.flatMap(lambda line: line.split(" "))
                 .map(lambda word: (word, 1))
                 .reduceByKey(lambda a, b: a + b)
    spark> counts.saveAsTextFile("hdfs://...")     

    6.b Spark Example
    cd ~/spark-1.3.1-bin-hadoop2.6
    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master spark://54.205.231.93:7077 \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar   1000
    ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master local[8] \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark/lib/spark-examples-1.3.1-hadoop1.0.4.jar 1000




    Reference
    1. https://spark.apache.org/docs/latest/submitting-application
    2. http://dal-cloudcomputing.blogspot.com/2013/04/create-aws-account-and-access-keys.html
    3. https://spark.apache.org/examples.html

    3 comments:


    1. thus this blog is really good just i got more information to your blog thus it is really nice and very much interesting.ya it is highlighting many important messages so that i like your message.

      cloud disaster recovery solutions

      ReplyDelete
    2. Somebody necessarily help to make severely posts I might state. This is the first time I frequented your website page and to this point? I surprised with the research you made to create this particular post extraordinary. Well done admin..

      Cloud Computing Training in Chennai

      ReplyDelete

    Followers

    Profile