Data Intensive Computing - ID2221

This course consists of four lab assignments that span over different topics of the course:

Lab 1: In the first lab assignment you will practice the basics of the data intensive programming by setting up HDFS and Hadoop MapReduce and implementing a simple application on them.
Lab 2: The second lab assignment is based of the Spark and Spark SQL, and you will learn how to use them to process and analyze data.
Lab 3: The concentration of this lab assignment is to process streaming data. You will work with Spark Streaming in this lab to process data on-line as it flows.
Lab 4: In the last assignment you will work with GraphX to process graph-based data.

Lab Assignments Environment

We have used the IPython/Jupyter notebooks for the lab assignments. Notebooks are documents that contain both the programming code (e.g., python or scala), as well as human-readable text elements (e.g., paragraph, figures, and links). Since we are using Scala programming language and Spark platform in our assignments, in addition to IPthon/Jupyter, we also need to install ISpark. Below, we first explain how to install IPython/Jupyter, and then present the ISpark installation steps. Note that these steps are given for a Linux operating system, so if you do not have Linux, you need to install it either on your machine or on a VirtualBox. You can download VirtualBox from its page. You can also find different ready to use Linux distribution images for VirtualBox here.

IPython Installation

The steps to install IPython/Jupyter:

Install pip, the package management system used to install and manage software packages written in Python.

sudo apt-get install python-dev libncurses-dev python-pip

Use pip to install IPython. The following command downloads and installs IPython and its main optional dependencies for the notebook, qtconsole, tests, and other functionality.

sudo pip install ipython[all]==3.2.1

Download and install Java and Spark on your machine. The easiest way to install Spark is to download its "pre-built for CDH 4" from the here.
Set the environment variables.

export JAVA_HOME=<JAVA_PATH>
export SPARK_HOME=<SPARK_PATH>
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$SPARK_HOME/python
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

Create an IPython profile to support Python programming language.

ipython profile create pyspark

Edit the created profile by copying the following lines in 00-pyspark-setup.py. This file is at ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py.

import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Run and test the IPython notebook by running the following command.

ipython notebook --profile=pyspark --debug

ISpark Installation

ISpark is an Apache Spark-shell backend for IPython.

Install Maven first, if you do not have it.

sudo apt-get install maven

Download ISpark from here.
ISpark needs to be compiled and packaged into a jar by Maven before being submitted and deployed. Extract the ISpark file you downloaded and run the following command.

cd ISpark-master
./mvn-install.sh

Create an ISpark profile.

ipython profile create spark

Edit the profile by copying the following lines in ipython_config.py, which is at ~/.ipython/profile_spark/ipython_config.py. Below we are referring to ispark-core-assembly-0.2.0-SNAPSHOT.jar, which is built in step 3. It is located at ISpark-master/core/target/scala-2.10.

import os
c = get_config()
spark_home = os.environ['SPARK_HOME']
master = 'local[2]'
c.KernelManager.kernel_cmd = [spark_home+"/bin/spark-submit", 
  "--master", master,
  "--class", "org.tribbloid.ispark.Main",
  "--executor-memory", "2G",
  "<PATH ON YOUR MACIHE>/ispark-core-assembly-0.2.0-SNAPSHOT.jar",
  "--profile", "{connection_file}", 
  "--parent"]

Run and test the notebook.

ipython notebook --profile=spark --debug

Things To Deliver

The assignments can be done individually or in group of two. For the labs 1 and 3, you should hand in the implemented codes in zip files, and for the other labs, you just need to deliver the completed IPython files.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
lab1		lab1
lab2		lab2
lab3		lab3
lab4		lab4
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Intensive Computing - ID2221

Lab Assignments Environment

IPython Installation

ISpark Installation

Things To Deliver

About

Releases

Packages

Contributors 2

Languages

payberah/id2221

Folders and files

Latest commit

History

Repository files navigation

Data Intensive Computing - ID2221

Lab Assignments Environment

IPython Installation

ISpark Installation

Things To Deliver

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages