Thursday, June 04, 2015

How to set up Spark on EC2

  1. setup AWS keys 
Follow how to setup Amzon AWS at [5]


export AWS_ACCESS_KEY_ID=xxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx

In case you need a spark [2]:
The latest release of Spark is Spark 1.3.1, released on April 17, 2015 (release notes) (git tag)
  1. Choose a Spark release: 1.3.1 (Apr 17 2015)
  2. Choose a package type
  3. Choose a download type: Select Apache MirrorDirect Download

  1. generate a keypairs at AWS as shown [1]

    1. download it to the local, especially at ec2 directory; Any directory for ampcamp?
      1. If not, SSH error will show up,
      2. for example, under the following directory: osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2
    2. chmod 400 [key.pem]  in amp3 but not working for EC2
      1. or chmod 600 [key.pem] 

  1. Run instances

osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/ec2$ ./spark-ec2 --key-pair=ampcamp3 --identity-file=~/.ssh/ampcamp3.pem --region=us-east-1 --zone=us-east-1a --copy launch my-spark-cluster

notes: -s (or --slaves) for number of slaves
osboxes@osboxes:~/proj/ampcamp3/training-scripts$ ./spark-ec2 -i ampcamp3.pem -k ampcamp3 -z us-east-1b -s 3 --copy launch amplab-training

4. Log in to master node

You need to go to AWS to find the master instance. Select the instance and choose "Connect" that shows the shell command to connect it remotely

And, http://master_node:8080 should be a spark page


5. Run HDFS at EC2 when HDFS has your data file that Spark needs to read
    root@ip-10-232-51-182$ cd /root/ephemeral-hdfs/bin
    root@ip-10-232-51-182 bin]$ ./start-dfs.sh
    root@ip-10-232-51-182 bin]$ ./hadoop fs -ls /
    root@ip-10-232-51-182 bin]$ ./hadoop fs -put samplefile /user/myname

    Note: security group for master node needs to open for TCP 7077

    6. Run any example at the master node [4, 7]

    6.a wordcount for samplefile

    cd ~/spark-1.3.1-bin-hadoop2.6

    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit
    spark> text_file = spark.textFile("/user/myname/samplefile")
    spark> counts = text_file.flatMap(lambda line: line.split(" "))
                 .map(lambda word: (word, 1))
                 .reduceByKey(lambda a, b: a + b)
    spark> counts.saveAsTextFile("hdfs://...")     

    6.b Spark Example
    cd ~/spark-1.3.1-bin-hadoop2.6
    osboxes@osboxes:~/spark-1.3.1-bin-hadoop2.6/$ ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master spark://54.205.231.93:7077 \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar   1000
    ./bin/spark-submit \
     --class org.apache.spark.examples.SparkPi \
     --master local[8] \
     --executor-memory 20G \
     --total-executor-cores 100 \
     ~/spark/lib/spark-examples-1.3.1-hadoop1.0.4.jar 1000




    Reference
    1. https://spark.apache.org/docs/latest/submitting-application
    2. http://dal-cloudcomputing.blogspot.com/2013/04/create-aws-account-and-access-keys.html
    3. https://spark.apache.org/examples.html

    Followers

    Profile