Spark Kmeans for Spam

Implementation of Kmeans algorithm on Spark system for clustering spam emails

Getting Started #

Requirements

Spark 0.7.0
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

sbt 0.11.3
A build tool for Scala and Java projects. It requires Java 1.6 or later.

Scala 2.9.2
Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way.

Usage

Run runKmeans.sh with these required parameters:

path to SPARK root
0 to only run or 1 to package in jar and run (required to deploy on cluster nodes)
spark configuration file (see below for format)
parameter file (see below for format)
outLabel optional label to add to output folder

E.g.

./runKmeans.sh /user/spark/spark-0.7.0 0 spark.conf parameter.conf kmeansOn1000spams

Configuration files

spark.conf

Parameter	Meaning
host	master URL passed to Spark
appName	name of job
inputfile	a path to input json dataset, either a local one on the machine or a hdfs://
evalOutput	1 to evaluate result, 0 otherwise

This is an example:

{
  "host": "spark://10.10.14.3:7077",
	"appName": "SparkKmeans",
	"inputFile": "hdfs://10.10.14.3:8020/user/spark/spam/spam10_6.json",
	"evalOutput": 1
}

Note: Master url must be follow these rules

Master URL	Meaning
local	Run Spark locally with one worker thread (i.e. no parallelism at all)
local[K]	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine)
spark://HOST:PORT	Connect to the given Spark standalone cluster master

parameter.conf

Parameter	Meaning
mode	method for kmeans, choose "PartialSums" or "Standard" (see Overview on algorithm )
initialCentroids	number of initial Centroids, a first approximation can be sqrt(inputSize/2)
convergeDist	Stopping condition: old and new centroids between iterations don't change more than this value.
maxIter	Stopping condition: iterations of the algorithm are limited to this value
numSamplesForMedoid	num of Elements in a cluster that will be used to update centroid for the categorical part in the Standard Kmeans (see Overview on algorithm )
weights	weights for features, values between 0 and 1. Sum of weights must be 1.

This is an example:

{
	"mode" : "PartialSums",
	"initialCentroids" : 100,
	"convergeDist" : 0.001,
	"maxIter" : 2,
	"numSamplesForMedoid" : 3,
	"weights" : {
		"space" : 0.20, 
		"time" : 0.20,
		"IP" : 0.20, 
		"uri" : 0.20, 
		"botname" : 0.20 
	}
}

Output File

The output files are in folder ./out
Each execution has its output folder ./out[outLabel_]dDATE with these following files:

File	Content
out	result Centroids
DATE.log	spark log file of executed job
parameter.conf	file with algorithm parameters used to run the program

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Kmeans for Spam

Getting Started #

Requirements

Usage

Configuration files

spark.conf

parameter.conf

Output File

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Spark Kmeans for Spam

Getting Started #

Requirements

Usage

Configuration files

spark.conf

parameter.conf

Output File