EMBER 2024 M3BA Multi-Output

A Multi-Label Malware Behavior Analyser

This project applies machine learning techniques to the multi-output, multi-label classification task presented by the EMBER 2024 dataset (Behavior subset). The goal is to identify and predict multiple behavioral traits of malware samples simultaneously using classical ML algorithms adapted for multi-output learning. It was developed by Alessio Murgioni and Riccardo Deidda as the final project for the Machine Learning course.

Overview

The EMBER 2024 dataset, originally created for malcious/benign classification, also provides behavioral features extracted from malware. This project focuses exclusively on the behavioral subset, which encodes malware categories such as ransomwares, backdoors, etc...

Each sample may exhibit multiple behaviors at once — making this a multi-output, multi-label classification problem rather than a standard single-label one.

Implemented Models

The following machine learning models are implemented and compared:

K-Nearest Neighbors (KNN)
Random Forest (RF)
Gradient Boosting (GB)

Since Random Forest and Gradient Boosting are not inherently multi-output, they are wrapped with Scikit-learn meta-estimators to extend their functionality:

OneVsRestClassifier — trains a separate classifier for each label.
MultiOutputClassifier — fits one model per output, allowing correlated target predictions.

This design allows direct comparison between base learners and multi-output strategies.

Each model is evaluated using the following performance metrics:

Metric	Description
precision	Proportion of correctly predicted positive labels
recall	Ability to retrieve all relevant labels
f1_macro	F1-score averaged equally across all labels
f1_micro	F1-score weighted by label frequency
hamm_loss	Fraction of incorrectly predicted labels (Hamming loss)
class_report	Detailed classification report per label (precision, recall, F1)
conf_matrix	Confusion matrix representing true vs. predicted labels

These metrics collectively evaluate all implemented models performances.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.idea		.idea
Behavior Dataset/Labels		Behavior Dataset/Labels
Results		Results
src_project		src_project
LICENSE		LICENSE
M3BA.pdf		M3BA.pdf
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EMBER 2024 M3BA Multi-Output

A Multi-Label Malware Behavior Analyser

Overview

Implemented Models

About

Uh oh!

Releases

Packages

Languages

License

RDbtx/MLproject

Folders and files

Latest commit

History

Repository files navigation

EMBER 2024 M3BA Multi-Output

A Multi-Label Malware Behavior Analyser

Overview

Implemented Models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages