4 Node Hadoop Spark Environment Setup (Hadoop 2.7.3 + Spark 2.1)

    1. Introduction

    Vagrant project to create a cluster of 4, 64-bit CentOS7 Linux virtual machines with Hadoop v2.7.3 and Spark v2.1.
    Minimum RAM Required: 4GB
    head : HDFS NameNode + Spark Master
    body : YARN ResourceManager + JobHistoryServer + ProxyServer
    slave1 : HDFS DataNode + YARN NodeManager + Spark Slave
    slave2 : HDFS DataNode + YARN NodeManager + Spark Slave

    2. Minimum System Requirements

    At least 1GB memory for each VM node. Default script is for 4 nodes, so you need 4GB for the nodes, in addition to the memory for your host machine.
    Vagrant 1.9.2, Virtualbox 5.1.14 (Use the exact version specified to avoid compatibility issues)
    Preserve the Unix/OSX end-of-line (EOL) characters while cloning this project; scripts will fail with Windows EOL characters. Project is tested on Centos 7.2 host OS; not tested with VMware provider for Vagrant.
    The Vagrant box is downloaded to the ~/.vagrant.d/boxes directory. On Windows, this is C:/Users/{your-username}/.vagrant.d/boxes.

    3. Installation Steps

    Download and install VirtualBox.
    Download and install Vagrant.
    Git clone this project, and change directory (cd) into cluster (directory).
    Download Hadoop 2.7.3 into the /resources directory.
    Download Spark 2.1 into the /resources directory.
    Run vagrant up to create the VM.
    Run vagrant ssh head to get into your VM.
    Run vagrant destroy when you want to destroy and get rid of the VM.

    4. Post Provisioning

    After you have provisioned the cluster, you need to run some commands to initialize your Hadoop cluster.
    SSH into head using command vagrant ssh head
    Commands below require root permissions. Change to root access using sudo su or create a new user and grant permissions if you want to use a non-root access. In such a case, you'll need to do this on VMs.

    Issue the following command.
    $HADOOP_PREFIX/bin/hdfs namenode -format myhadoop

    Start Hadoop Daemons (HDFS + YARN)
    SSH into head and issue the following commands to start HDFS.
    vagrant ssh head
    To start namenode
    $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
    To start datanode
    $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

    SSH into body and issue the following commands to start YARN.
    vagrant ssh body
    To start resource manager
    $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
    To start node manager
    $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager
    To start proxy server
    $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
    To start job history server
    $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR

    Test YARN
    Run the following command to make sure you can run a MapReduce job.
    vagrant ssh body

    yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 100

    Start Spark in Standalone Mode
    SSH into head and issue the following command.
    vagrant ssh head

    $SPARK_HOME/sbin/start-all.sh

    Test Spark on YARN
    You can test if Spark can run on YARN by issuing the following command. Try NOT to run this command on the slave nodes.
    vagrant ssh head

    $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 10 \ --executor-cores 2 \ $SPARK_HOME/examples/jars/spark-examples*.jar \ 100

    Test Spark using Shell
    Start the Spark shell using the following command. Try NOT to run this command on the slave nodes.
    vagrant ssh head

    $SPARK_HOME/bin/spark-shell --master spark://head:7077

    5. Web UI

    You can check the following URLs to monitor the Hadoop daemons.
    [NameNode] (http://localhost:50070/dfshealth.html)

    [ResourceManager] (http://localhost:18088/cluster)

    [JobHistory] (http://localhost:19888/jobhistory)

    [Spark] (http://localhost:8080)

    View Source Code (Github)