Add fault injection training robustness research study and example#56
Add fault injection training robustness research study and example#56
Conversation
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
|
@copilot the examples folder is very cluttered now. Create a folder specifically for this this similar to the shipsnet folder. Try not to add too many files. |
- Create fault_injection_training/ folder similar to shipsnet/ - Move all 7 files into organized structure - Consolidate documentation (2 READMEs → 1) - Update main examples README with reference - Reduce clutter in examples root folder Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Organized into |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable research study and example on improving model robustness through fault injection during training. The contribution is well-structured, providing a standalone Python script, an interactive Jupyter notebook, and comprehensive documentation. The code is generally clean and follows a logical experimental flow. I've identified a few areas for improvement, mainly concerning naming conventions for clarity, removal of unused code, and a minor correction in the documentation. Overall, this is an excellent addition to the examples.
| class SimpleCNN(nn.Module): | ||
| """Simple feedforward network for binary classification""" | ||
|
|
||
| def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]): | ||
| super(SimpleCNN, self).__init__() |
There was a problem hiding this comment.
The class is named SimpleCNN, but its architecture is a feedforward neural network (MLP), not a Convolutional Neural Network. This is misleading. I suggest renaming it to SimpleMLP for clarity. Remember to update the class instantiations at lines 470 and 479 accordingly.
Additionally, the super() call uses a Python 2-style syntax. It can be updated to the more concise Python 3 style.
| class SimpleCNN(nn.Module): | |
| """Simple feedforward network for binary classification""" | |
| def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]): | |
| super(SimpleCNN, self).__init__() | |
| class SimpleMLP(nn.Module): | |
| """Simple feedforward network for binary classification""" | |
| def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]): | |
| super().__init__() |
| def inject_faults_in_weights(model, fault_prob=0.01, bit_position=None): | ||
| """ | ||
| Inject bit flips into model weights during training | ||
|
|
||
| Args: | ||
| model: PyTorch model | ||
| fault_prob: Probability of flipping each weight | ||
| bit_position: Specific bit to flip (None = random bit) | ||
| """ | ||
| with torch.no_grad(): | ||
| for param in model.parameters(): | ||
| if param.requires_grad and param.dtype == torch.float32: | ||
| # Determine which weights to flip | ||
| mask = torch.rand_like(param) < fault_prob | ||
| flipped_count = mask.sum().item() | ||
|
|
||
| if flipped_count > 0: | ||
| # Convert to numpy for bit manipulation | ||
| param_np = param.cpu().numpy() | ||
|
|
||
| # Flip bits using SEU framework function | ||
| mask_np = mask.cpu().numpy() | ||
| for idx in np.ndindex(param_np.shape): | ||
| if mask_np[idx]: | ||
| if bit_position is not None: | ||
| # Flip specific bit | ||
| param_np[idx] = bitflip_float32_fast(param_np[idx], bit_position) | ||
| else: | ||
| # Flip random bit | ||
| random_bit = np.random.randint(0, 32) | ||
| param_np[idx] = bitflip_float32_fast(param_np[idx], random_bit) | ||
|
|
||
| # Update parameter | ||
| param.copy_(torch.from_numpy(param_np)) | ||
|
|
There was a problem hiding this comment.
|
|
||
| ## 📚 Literature References | ||
|
|
||
| 1. **Mitigating Multiple Single-Event Upsets** (arXiv 2502.09374) - Up to 3× improvement with fault-aware training |
| "class SimpleCNN(nn.Module):\n", | ||
| " def __init__(self, hidden_sizes=[64, 32, 16]):\n", | ||
| " super().__init__()\n", | ||
| " layers = []\n", | ||
| " prev = 2\n", | ||
| " for h in hidden_sizes:\n", | ||
| " layers.extend([nn.Linear(prev, h), nn.ReLU()])\n", | ||
| " prev = h\n", | ||
| " layers.extend([nn.Linear(prev, 1), nn.Sigmoid()])\n", | ||
| " self.network = nn.Sequential(*layers)\n", | ||
| "\n", | ||
| " def forward(self, x):\n", | ||
| " return self.network(x)\n", | ||
| "\n", | ||
| "\n", | ||
| "model = SimpleCNN()\n", | ||
| "print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")" |
There was a problem hiding this comment.
Addresses research question: "How does training with fault injection improve robustness to SEUs?" Demonstrates 56-74% robustness improvement via fault-aware training without hardware modifications.
Implementation
Organized in
examples/fault_injection_training/folder following the shipsnet pattern:Research Script (
fault_injection_training_study.py)Interactive Notebook (
notebook.ipynb)Documentation
README.md- Consolidated usage guide, methodology, and complete research report with hypothesis validationKey Results
Clean data accuracy maintained at 92.17%. Training overhead <5%, inference overhead 0%.
Usage
Technical Notes
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.