GPU MD5 Cracker

A high-performance, GPU-accelerated MD5 hash cracker built using CUDA.
This project is part of our final course project for ECE 759: High-Performance Computing (Spring 2025) at UW–Madison.

Project Overview

The goal of this project is to design and implement a parallelized MD5 hash attack using CUDA to efficiently brute-force 7-character alphanumeric passwords.
Modern GPUs can drastically accelerate hash computation through data-parallelism, and this project demonstrates that speedup against a CPU baseline.

Project Deliverables

Project Proposal: https://docs.google.com/document/d/11ZABF44t7TLryVB_koVmSWl0wW3Ki8J3/edit?usp=sharing&ouid=103892982613643687776&rtpof=true&sd=true
Project Presentation: https://docs.google.com/presentation/d/1lvsB7_WiYRj-wiXRcinhw2UV49OnOkf9Ii4djbQtuHM/edit?usp=sharing
Project Report: https://docs.google.com/document/d/1knIPjmDEbHlpaQAnQpM1v_nIzg3noSsbppSNvpfePjo/edit?usp=sharing

Team Members

Elice Priyadarshini ([email protected])
Michael Pan ([email protected])
Saketh Katta ([email protected])
Ankit Mohapatra ([email protected])
Fahad Touseef ([email protected])
K M Jamiul Haque ([email protected])

Tech Stack

CUDA Toolkit (v12.x)
C++ / CUDA C
Thrust & CUB (for parallel primitives)
OpenMP (CPU baseline)
Python (matplotlib & pandas for plotting/benchmarking)
NVIDIA Nsight Compute (profiling & roofline analysis)

Building

GPU Executable

cd src
nvcc -O3 -arch=sm_86 main.cu -o ../bin/gpu_crack.exe

# (Optional) Enforce ≤32 registers/thread:
nvcc -O3 -arch=sm_86 -maxrregcount=32 main.cu -o ../bin/gpu_crack.exe

### CPU executable

```bash
cd src
cl /O2 /openmp md5_cpu.cpp -o ../bin/cpu_crack.exe

Note: Ensure OpenMP and standard C++ libraries are installed.

Usage

GPU mode

bin\gpu_crack.exe <32-char MD5 hex>

CPU mode (with optional thread count)

bin\cpu_crack.exe <32-char MD5 hex> -t 8

Example

bin\gpu_crack.exe 5d793fc5b00a2348c3fb9ab59e5ca98a

Benchmarking & Profiling

Collect timings

cd benchmark
run.bat 5d793fc5b00a2348c3fb9ab59e5ca98a 8

Outputs results.csv.

Plot results
```
python plot_results.py
```

GPU profiling (Nsight Compute)

cd src
ncu -o brute7_report --set full --target-processes all ../bin/gpu_crack.exe 5d793fc5b00a2348c3fb9ab59e5ca98a
ncu-ui brute7_report.ncu-rep   # Open GUI for roofline & timeline

Results 📊

Experimental setup

Item	Value
CPU	Ryzen 7 5800H (8 C / 16 T)
GPU	RTX 3060 Laptop (3840 CUDA cores, 6 GB, sm_86)
CUDA	12.x (driver & toolkit)
Build flags	`-O3 -std=c++17 -Xcompiler /openmp` `-maxrregcount=32` for the GPU executable

All benchmarks were run on AC-power with the dGPU enabled and no other heavy workloads.

Optimisation phases

Phase	Kernel change	Rationale	Δ (Ghash/s, length 5)	Key observations
Baseline	naïve 64-round `for` loop	starting point	6.22	GPU ≈ 40 Ghash/s at length 6; CPU wins only for very short searches
Phase 1 – Unroll	manual 4 × 16 round unroll	removes loop overhead, exposes ILP	+1.7 % → 6.33	big jump at length 4 (49 → 82 Ghash/s); negligible cost
Phase 2 – Shared M	copy the 16 message-words to shared memory once per thread	avoid 64 constant-mem reads	+1.7 % → 6.44	slight loss at length 4 (copy cost > benefit) but +1–2 % for length ≥ 5

Take-away: after both phases the GPU sustains ~6.4 Ghash/s on the 916 M-candidate length-5 space—35× faster than the 16-thread CPU baseline.

End-to-end timings (best kernel)

Password length	Search-space (62^n)	CPU time (s)	GPU time (s)	Speed-up ×	GPU hash-rate (Ghash/s)
1	62	0.00117	0.00017	6.8×	—
2	3 844	0.00138	0.00018	7.6×	0.02
3	238 328	0.00120	0.00017	6.9×	1.37
4	14.8 M	0.00619	0.00021	29.4×	70.2
5	916 M	0.342	0.142	2.4×	6.44
6	56.8 B	21.49	1.40	15.4×	40.6
7	3.52 T	1 321 *	148.6 *	8.9×	23.9

* Full length-7 run included to show worst-case; all shorter lengths exit when the password is found.

Plots

Figure	Description
	Figure 1 – Time vs. password length. GPU dominates for spaces ≥ 62⁴ despite copy overhead at length 4.
	Figure 2 – GPU / CPU speed-up. Peaks at 35× for length 4; stabilises around 15× for real-world 6–7-char cases.
	Figure 3 – Raw hash throughput. RTX 3060 ≈ 4.8 Ghash/s sustained vs. 7 M hash/s on the CPU.
	Figure 4 – Block-size sweep. Plateau at 64–512 threads; 512 used for final build.

(PNG files live in results/; timestamps match the CSVs.)

Discussion

Loop unrolling delivered the biggest gain for cache-friendly lengths (≤ 62⁴) with zero downsides.
Shared-memory staging traded a one-off 16×32 B copy for 64 constant-mem reads: break-even at length 4, small wins for ≥ 5.
Remaining performance is bandwidth-bound: Nsight’s roofline shows 78 % of peak L2 BW.
Further gains would need a better memory layout or warp-level charset shuffles (__shfl_sync).

Reproducing the Numbers

# 1) build best kernel
cd src
nvcc -O3 -std=c++17 -maxrregcount=32 -DBLOCK_SIZE=512 -o ../bin/md5cracker.exe main.cu

# 2) run automated benchmark suite
cd ../benchmark
python benchmark.py --max-len 6
python throughput.py

Contributing

Contributions, issues, and feature requests are welcome!
Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License.
See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
SlurmBenchmarkTesting		SlurmBenchmarkTesting
benchmark		benchmark
include		include
pragmaTesting		pragmaTesting
projectMd5		projectMd5
report		report
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU MD5 Cracker

Project Overview

Project Deliverables

Team Members

Tech Stack

Building

GPU Executable

Usage

GPU mode

CPU mode (with optional thread count)

Example

Benchmarking & Profiling

Results 📊

Experimental setup

Optimisation phases

End-to-end timings (best kernel)

Plots

Discussion

Reproducing the Numbers

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

elicep01/gpu-md5-cracker

Folders and files

Latest commit

History

Repository files navigation

GPU MD5 Cracker

Project Overview

Project Deliverables

Team Members

Tech Stack

Building

GPU Executable

Usage

GPU mode

CPU mode (with optional thread count)

Example

Benchmarking & Profiling

Results 📊

Experimental setup

Optimisation phases

End-to-end timings (best kernel)

Plots

Discussion

Reproducing the Numbers

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages