A high-performance, GPU-accelerated MD5 hash cracker built using CUDA.
This project is part of our final course project for ECE 759: High-Performance Computing (Spring 2025) at UW–Madison.
The goal of this project is to design and implement a parallelized MD5 hash attack using CUDA to efficiently brute-force 7-character alphanumeric passwords.
Modern GPUs can drastically accelerate hash computation through data-parallelism, and this project demonstrates that speedup against a CPU baseline.
Project Proposal: https://docs.google.com/document/d/11ZABF44t7TLryVB_koVmSWl0wW3Ki8J3/edit?usp=sharing&ouid=103892982613643687776&rtpof=true&sd=true
Project Presentation: https://docs.google.com/presentation/d/1lvsB7_WiYRj-wiXRcinhw2UV49OnOkf9Ii4djbQtuHM/edit?usp=sharing
Project Report: https://docs.google.com/document/d/1knIPjmDEbHlpaQAnQpM1v_nIzg3noSsbppSNvpfePjo/edit?usp=sharing
- Elice Priyadarshini ([email protected])
- Michael Pan ([email protected])
- Saketh Katta ([email protected])
- Ankit Mohapatra ([email protected])
- Fahad Touseef ([email protected])
- K M Jamiul Haque ([email protected])
- CUDA Toolkit (v12.x)
- C++ / CUDA C
- Thrust & CUB (for parallel primitives)
- OpenMP (CPU baseline)
- Python (matplotlib & pandas for plotting/benchmarking)
- NVIDIA Nsight Compute (profiling & roofline analysis)
cd src
nvcc -O3 -arch=sm_86 main.cu -o ../bin/gpu_crack.exe
# (Optional) Enforce ≤32 registers/thread:
nvcc -O3 -arch=sm_86 -maxrregcount=32 main.cu -o ../bin/gpu_crack.exe
### CPU executable
```bash
cd src
cl /O2 /openmp md5_cpu.cpp -o ../bin/cpu_crack.exeNote: Ensure OpenMP and standard C++ libraries are installed.
bin\gpu_crack.exe <32-char MD5 hex>bin\cpu_crack.exe <32-char MD5 hex> -t 8bin\gpu_crack.exe 5d793fc5b00a2348c3fb9ab59e5ca98a-
Collect timings
cd benchmark run.bat 5d793fc5b00a2348c3fb9ab59e5ca98a 8Outputs
results.csv. -
Plot results
python plot_results.py
-
GPU profiling (Nsight Compute)
cd src ncu -o brute7_report --set full --target-processes all ../bin/gpu_crack.exe 5d793fc5b00a2348c3fb9ab59e5ca98a ncu-ui brute7_report.ncu-rep # Open GUI for roofline & timeline
| Item | Value |
|---|---|
| CPU | Ryzen 7 5800H (8 C / 16 T) |
| GPU | RTX 3060 Laptop (3840 CUDA cores, 6 GB, sm_86) |
| CUDA | 12.x (driver & toolkit) |
| Build flags | -O3 -std=c++17 -Xcompiler /openmp-maxrregcount=32 for the GPU executable |
All benchmarks were run on AC-power with the dGPU enabled and no other heavy workloads.
| Phase | Kernel change | Rationale | Δ (Ghash/s, length 5) | Key observations |
|---|---|---|---|---|
| Baseline | naïve 64-round for loop |
starting point | 6.22 | GPU ≈ 40 Ghash/s at length 6; CPU wins only for very short searches |
| Phase 1 – Unroll | manual 4 × 16 round unroll | removes loop overhead, exposes ILP | +1.7 % → 6.33 | big jump at length 4 (49 → 82 Ghash/s); negligible cost |
| Phase 2 – Shared M | copy the 16 message-words to shared memory once per thread | avoid 64 constant-mem reads | +1.7 % → 6.44 | slight loss at length 4 (copy cost > benefit) but +1–2 % for length ≥ 5 |
Take-away: after both phases the GPU sustains ~6.4 Ghash/s on the 916 M-candidate length-5 space—35× faster than the 16-thread CPU baseline.
| Password length | Search-space (62^n) | CPU time (s) | GPU time (s) | Speed-up × | GPU hash-rate (Ghash/s) |
|---|---|---|---|---|---|
| 1 | 62 | 0.00117 | 0.00017 | 6.8× | — |
| 2 | 3 844 | 0.00138 | 0.00018 | 7.6× | 0.02 |
| 3 | 238 328 | 0.00120 | 0.00017 | 6.9× | 1.37 |
| 4 | 14.8 M | 0.00619 | 0.00021 | 29.4× | 70.2 |
| 5 | 916 M | 0.342 | 0.142 | 2.4× | 6.44 |
| 6 | 56.8 B | 21.49 | 1.40 | 15.4× | 40.6 |
| 7 | 3.52 T | 1 321 * | 148.6 * | 8.9× | 23.9 |
* Full length-7 run included to show worst-case; all shorter lengths exit when the password is found.
(PNG files live in results/; timestamps match the CSVs.)
- Loop unrolling delivered the biggest gain for cache-friendly lengths (≤ 62⁴) with zero downsides.
- Shared-memory staging traded a one-off 16×32 B copy for 64 constant-mem reads: break-even at length 4, small wins for ≥ 5.
- Remaining performance is bandwidth-bound: Nsight’s roofline shows 78 % of peak L2 BW.
Further gains would need a better memory layout or warp-level charset shuffles (__shfl_sync).
# 1) build best kernel
cd src
nvcc -O3 -std=c++17 -maxrregcount=32 -DBLOCK_SIZE=512 -o ../bin/md5cracker.exe main.cu
# 2) run automated benchmark suite
cd ../benchmark
python benchmark.py --max-len 6
python throughput.py
Contributions, issues, and feature requests are welcome!
Please fork the repository and submit a pull request.
This project is licensed under the MIT License.
See the LICENSE file for details.



