GetStarted_yarn

Running CaffeOnSpark on Hadoop YARN Cluster

Clone CaffeOnSpark code.

git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark

Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html
Create a CaffeOnSpark/caffe-public/Makefile.config

pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
popd

Uncomment settings as needed:

CPU_ONLY := 1  #if you havce CPU
USE_CUDNN := 1 #if you want to use CUDNN

Build CaffeOnSpark

pushd ${CAFFE_ON_SPARK}
make build
popd
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/

Installl Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.

${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
cp ${CAFFE_ON_SPARK}/scripts/*.xml  ${HADOOP_HOME}/etc/hadoop
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop

${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6

export PATH=${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${PATH}

If you cannot ssh to localhost without a passphrase, execute the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Start YARN cluster

${HADOOP_HOME}/bin/hdfs namenode -format
${HADOOP_HOME}/sbin/start-dfs.sh
${HADOOP_HOME}/sbin/start-yarn.sh

Install mnist and cifar10 dataset into its HDFS

hadoop fs -mkdir -p /projects/machine_learning/image_dataset

${CAFFE_ON_SPARK}/scripts/setup-mnist.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/mnist_*_lmdb hdfs:/projects/machine_learning/image_dataset/

${CAFFE_ON_SPARK}/scripts/setup-cifar10.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/cifar10_*_lmdb hdfs:/projects/machine_learning/image_dataset/

Adjust data/lenet_memory_solver.prototxt and data/cifar10_quick_solver.prototxt with appropriate mode.

solver_mode: CPU #GPU if you use GPU nodes

Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.

export SPARK_WORKER_INSTANCES=2 
export DEVICES=1
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster \
    --num-executors ${SPARK_WORKER_INSTANCES} \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
        -devices ${DEVICES} \
	-connection ethernet \
        -model hdfs:///mnist.model \
        -output hdfs:///mnist_features_result
hadoop fs -ls hdfs:///mnist.model
hadoop fs -cat hdfs:///mnist_features_result/*

The training will produce a model and various snapshots.

-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:57 /mnist_lenet.model
-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:57 /mnist_lenet_iter_10000.caffemodel
-rw-r--r--   3 root supergroup    1724462 2016-02-20 00:57 /mnist_lenet_iter_10000.solverstate
-rw-r--r--   3 root supergroup    1725052 2016-02-20 00:56 /mnist_lenet_iter_5000.caffemodel
-rw-r--r--   3 root supergroup    1724461 2016-02-20 00:56 /mnist_lenet_iter_5000.solverstate

The feature result file should look like:

{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}

You could run a similar steps for cifar10 datasets.

export SPARK_WORKER_INSTANCES=2 
export DEVICES=1
hadoop fs -rm -f hdfs:///cifar10.model.h5
hadoop fs -rm -r -f hdfs:///cifar10_features_result
spark-submit --master yarn --deploy-mode cluster \
    --num-executors ${SPARK_WORKER_INSTANCES} \
    --files ${CAFFE_ON_SPARK}/data/cifar10_quick_solver.prototxt,${CAFFE_ON_SPARK}/data/cifar10_quick_train_test.prototxt,${CAFFE_ON_SPARK}/data/mean.binaryproto \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf cifar10_quick_solver.prototxt \
        -devices ${DEVICES} \
	-connection ethernet \
        -model hdfs:///cifar10.model.h5 \
        -output hdfs:///cifar10_features_result
hadoop fs -ls hdfs:///cifar10.model.h5
hadoop fs -cat hdfs:///cifar10_features_result/*

Access CaffeOnSpark from Python

Get started with python on CaffeOnSpark

Shutdown YARN cluster

${HADOOP_HOME}/sbin/stop-yarn.sh
${HADOOP_HOME}/sbin/stop-dfs.sh
rm -rf /tmp/hadoop-${USER}
rm -rf ${HADOOP_HOME}/logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GetStarted_yarn

Running CaffeOnSpark on Hadoop YARN Cluster

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally