4 Node Hadoop Spark Environment Setup (Hadoop 2.7.3 + Spark 2.1)
1. Introduction
Vagrant project to create a cluster of 4, 64-bit CentOS7 Linux virtual machines with Hadoop v2.7.3 and Spark v2.1.
Minimum RAM Required: 4GB
head : HDFS NameNode + Spark Master
body : YARN ResourceManager + JobHistoryServer + ProxyServer
slave1 : HDFS DataNode + YARN NodeManager + Spark Slave
slave2 : HDFS DataNode + YARN NodeManager + Spark Slave
2. Minimum System Requirements
At least 1GB memory for each VM node. Default script is for 4 nodes, so you need 4GB for the nodes, in addition to the memory for your host machine.
Vagrant 1.9.2, Virtualbox 5.1.14 (Use the exact version specified to avoid compatibility issues)
Preserve the Unix/OSX end-of-line (EOL) characters while cloning this project; scripts will fail with Windows EOL characters.
Project is tested on Centos 7.2 host OS; not tested with VMware provider for Vagrant.
The Vagrant box is downloaded to the ~/.vagrant.d/boxes directory. On Windows, this is C:/Users/{your-username}/.vagrant.d/boxes.
3. Installation Steps
Download and install VirtualBox.
Download and install Vagrant.
Git clone this project, and change directory (cd) into cluster (directory).
Download Hadoop 2.7.3 into the /resources directory.
Download Spark 2.1 into the /resources directory.
Run vagrant up to create the VM.
Run vagrant ssh head to get into your VM.
Run vagrant destroy when you want to destroy and get rid of the VM.
4. Post Provisioning
After you have provisioned the cluster, you need to run some commands to initialize your Hadoop cluster.
SSH into head using command vagrant ssh head
Commands below require root permissions.
Change to root access using sudo su or create a new user and grant permissions if you want to use a non-root access. In such a case, you'll need to do this on VMs.
Issue the following command.
$HADOOP_PREFIX/bin/hdfs namenode -format myhadoop
Start Hadoop Daemons (HDFS + YARN)
SSH into head and issue the following commands to start HDFS.
vagrant ssh head
To start namenode
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
To start datanode
$HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
SSH into body and issue the following commands to start YARN.
vagrant ssh body
To start resource manager
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
To start node manager
$HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager
To start proxy server
$HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
To start job history server
$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
Test YARN
Run the following command to make sure you can run a MapReduce job.
vagrant ssh body
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 100
Start Spark in Standalone Mode
SSH into head and issue the following command.
vagrant ssh head
$SPARK_HOME/sbin/start-all.sh
Test Spark on YARN
You can test if Spark can run on YARN by issuing the following command. Try NOT to run this command on the slave nodes.
vagrant ssh head
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 10 \
--executor-cores 2 \
$SPARK_HOME/examples/jars/spark-examples*.jar \
100
Test Spark using Shell
Start the Spark shell using the following command. Try NOT to run this command on the slave nodes.
vagrant ssh head
$SPARK_HOME/bin/spark-shell --master spark://head:7077
5. Web UI
You can check the following URLs to monitor the Hadoop daemons.
[NameNode] (http://localhost:50070/dfshealth.html)
[ResourceManager] (http://localhost:18088/cluster)
[JobHistory] (http://localhost:19888/jobhistory)
[Spark] (http://localhost:8080)
View Source Code (Github)