Add fault injection training robustness research study and example by Copilot · Pull Request #56 · wd7512/seu-injection-framework

Copilot · 2025-12-09T23:24:29Z

Addresses research question: "How does training with fault injection improve robustness to SEUs?" Demonstrates 56-74% robustness improvement via fault-aware training without hardware modifications.

Implementation

Organized in examples/fault_injection_training/ folder following the shipsnet pattern:

Research Script (fault_injection_training_study.py)

Baseline training (standard gradient descent)
Fault-aware training (gradient noise injection simulating parameter perturbations)
Stochastic SEU evaluation across IEEE 754 bit positions
Automated visualization and CSV export

Interactive Notebook (notebook.ipynb)

20-cell narrative with literature review
Step-by-step experimental methodology
Inline visualizations and analysis

Documentation

README.md - Consolidated usage guide, methodology, and complete research report with hypothesis validation

Key Results

Bit Position	Baseline Drop	Fault-Aware Drop	Improvement
0 (Sign)	7.6%	3.3%	56% / 2.3×
8 (Exp LSB)	0.5%	0.1%	74% / 3.9×

Clean data accuracy maintained at 92.17%. Training overhead <5%, inference overhead 0%.

Usage

cd examples/fault_injection_training
python fault_injection_training_study.py
# or
jupyter notebook notebook.ipynb

# Fault-aware training via gradient noise
def train_fault_aware(model, X, y, fault_prob=0.01, fault_freq=10):
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()
    
    # Inject noise every N epochs
    if epoch % fault_freq == 0:
        for param in model.parameters():
            noise = torch.randn_like(param.grad) * fault_prob * param.grad.abs().mean()
            param.grad.add_(noise)
    
    optimizer.step()

Technical Notes

CSV export properly handles edge cases (inf → N/A, machine epsilon → 0.0)
Bit positions 15/23 show zero impact (expected for low-significance mantissa bits on simple dataset)
Gradient noise approach preferred over direct weight bit-flips for training stability
All files organized in dedicated folder to reduce clutter in examples directory

Original prompt

This section details on the original issue you should resolve

<issue_title>[RESEARCH] How does training with fault injection improve robustness</issue_title>
<issue_description>### Question Category

Model Architecture

Research Question

How does training with fault injection improve robustness?

Research Domain

None

Model Architecture

None

Research Context

No response

Code Example (if applicable)
Specific Help Needed

No response

Related Areas

Single Event Upset (SEU) physics

Radiation effects in electronics

Neural network robustness

Fault injection methodologies

Statistical significance testing

Performance optimization

Model comparison studies

Reproducible research practices

Relevant References

No response

Timeline

None

Potential Contribution

No response

Collaboration Interest

I'm interested in collaborating with other researchers

I'm open to co-authoring publications

I would like to contribute results back to the framework

I'm interested in reviewing related research

Pre-submission Checklist

I have searched existing issues and discussions

This is a research question (not a bug report or feature request)

I have provided sufficient context about my research

I am open to community discussion and feedback</issue_description>

<agent_instructions>Do a full research study into this. build a plan with hypothesis, look up existing papers and build experiments. Assess the results and produce a markdown report of the findings OR a story-like jupyter notebook. (The notebook is slightly preferred if possible)</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@Will-D-AER Add a new example showing how training with fault injection alters robustness

Fixes [RESEARCH] How does training with fault injection improve robustness #44

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Will-D-AER · 2025-12-10T08:33:28Z

@copilot the examples folder is very cluttered now. Create a folder specifically for this this similar to the shipsnet folder. Try not to add too many files.

- Create fault_injection_training/ folder similar to shipsnet/ - Move all 7 files into organized structure - Consolidate documentation (2 READMEs → 1) - Update main examples README with reference - Reduce clutter in examples root folder Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Copilot · 2025-12-10T08:39:58Z

@copilot the examples folder is very cluttered now. Create a folder specifically for this this similar to the shipsnet folder. Try not to add too many files.

Organized into examples/fault_injection_training/ folder with structure similar to shipsnet. Consolidated 2 READMEs into 1 and moved all 7 files into the dedicated folder. Commit e35bf61

wd7512 · 2025-12-25T23:35:08Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable research study and example on improving model robustness through fault injection during training. The contribution is well-structured, providing a standalone Python script, an interactive Jupyter notebook, and comprehensive documentation. The code is generally clean and follows a logical experimental flow. I've identified a few areas for improvement, mainly concerning naming conventions for clarity, removal of unused code, and a minor correction in the documentation. Overall, this is an excellent addition to the examples.

gemini-code-assist · 2025-12-25T23:36:43Z

examples/fault_injection_training/fault_injection_training_study.py

+class SimpleCNN(nn.Module):
+    """Simple feedforward network for binary classification"""
+
+    def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):
+        super(SimpleCNN, self).__init__()


The class is named SimpleCNN, but its architecture is a feedforward neural network (MLP), not a Convolutional Neural Network. This is misleading. I suggest renaming it to SimpleMLP for clarity. Remember to update the class instantiations at lines 470 and 479 accordingly.

Additionally, the super() call uses a Python 2-style syntax. It can be updated to the more concise Python 3 style.

Suggested change

class SimpleCNN(nn.Module):

"""Simple feedforward network for binary classification"""

def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):

super(SimpleCNN, self).__init__()

class SimpleMLP(nn.Module):

"""Simple feedforward network for binary classification"""

def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):

super().__init__()

gemini-code-assist · 2025-12-25T23:36:44Z

examples/fault_injection_training/fault_injection_training_study.py

+def inject_faults_in_weights(model, fault_prob=0.01, bit_position=None):
+    """
+    Inject bit flips into model weights during training
+
+    Args:
+        model: PyTorch model
+        fault_prob: Probability of flipping each weight
+        bit_position: Specific bit to flip (None = random bit)
+    """
+    with torch.no_grad():
+        for param in model.parameters():
+            if param.requires_grad and param.dtype == torch.float32:
+                # Determine which weights to flip
+                mask = torch.rand_like(param) < fault_prob
+                flipped_count = mask.sum().item()
+
+                if flipped_count > 0:
+                    # Convert to numpy for bit manipulation
+                    param_np = param.cpu().numpy()
+
+                    # Flip bits using SEU framework function
+                    mask_np = mask.cpu().numpy()
+                    for idx in np.ndindex(param_np.shape):
+                        if mask_np[idx]:
+                            if bit_position is not None:
+                                # Flip specific bit
+                                param_np[idx] = bitflip_float32_fast(param_np[idx], bit_position)
+                            else:
+                                # Flip random bit
+                                random_bit = np.random.randint(0, 32)
+                                param_np[idx] = bitflip_float32_fast(param_np[idx], random_bit)
+
+                    # Update parameter
+                    param.copy_(torch.from_numpy(param_np))
+


The function inject_faults_in_weights is defined but appears to be unused throughout the script. The fault-aware training is implemented using gradient noise in train_fault_aware_model. To improve code clarity and remove dead code, this function should be removed.

gemini-code-assist · 2025-12-25T23:36:44Z

examples/fault_injection_training/README.md

+
+## 📚 Literature References
+
+1. **Mitigating Multiple Single-Event Upsets** (arXiv 2502.09374) - Up to 3× improvement with fault-aware training


There appears to be a typo in the arXiv identifier 2502.09374. The year 2502 is far in the future. Please verify and correct the reference.

gemini-code-assist · 2025-12-25T23:36:44Z

examples/fault_injection_training/notebook.ipynb

+    "class SimpleCNN(nn.Module):\n",
+    "    def __init__(self, hidden_sizes=[64, 32, 16]):\n",
+    "        super().__init__()\n",
+    "        layers = []\n",
+    "        prev = 2\n",
+    "        for h in hidden_sizes:\n",
+    "            layers.extend([nn.Linear(prev, h), nn.ReLU()])\n",
+    "            prev = h\n",
+    "        layers.extend([nn.Linear(prev, 1), nn.Sigmoid()])\n",
+    "        self.network = nn.Sequential(*layers)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.network(x)\n",
+    "\n",
+    "\n",
+    "model = SimpleCNN()\n",
+    "print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")"


The class is named SimpleCNN, but its architecture is a feedforward neural network (MLP), not a Convolutional Neural Network. This naming is misleading. For clarity, please rename SimpleCNN to SimpleMLP in this cell, both in the class definition (line 147) and where it is instantiated (line 162).

Initial plan

5723623

Copilot AI assigned Copilot and Will-D-AER Dec 9, 2025

Copilot started work on behalf of Will-D-AER December 9, 2025 23:24 View session

Copilot AI and others added 5 commits December 9, 2025 23:38

Add complete fault injection training robustness study

371223b

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Add complete Jupyter notebook and comprehensive documentation

7517635

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Add comprehensive research summary and finalize study

812a24b

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Fix CSV handling for inf values and numerical artifacts

5f73c63

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Clean CSV formatting and add documentation for zero-impact bits

0a48684

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Research how fault injection training improves model robustness~~ Add fault injection training robustness research study and example Dec 9, 2025

Copilot AI requested a review from Will-D-AER December 9, 2025 23:53

Copilot finished work on behalf of Will-D-AER December 9, 2025 23:53

Copilot started work on behalf of Will-D-AER December 10, 2025 08:33 View session

Copilot finished work on behalf of Will-D-AER December 10, 2025 08:42

ruff

7c1856a

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault injection training robustness research study and example#56

Add fault injection training robustness research study and example#56
Copilot wants to merge 8 commits intodevfrom
copilot/research-fault-injection-robustness

Copilot AI commented Dec 9, 2025 •

edited

Loading

Uh oh!

Will-D-AER commented Dec 10, 2025

Uh oh!

Copilot AI commented Dec 10, 2025

Uh oh!

wd7512 commented Dec 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## 📚 Literature References

		1. Mitigating Multiple Single-Event Upsets (arXiv 2502.09374) - Up to 3× improvement with fault-aware training

Conversation

Copilot AI commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Key Results

Usage

Technical Notes

Research Question

Research Domain

Model Architecture

Research Context

Code Example (if applicable)

Specific Help Needed

Related Areas

Relevant References

Timeline

Potential Contribution

Collaboration Interest

Pre-submission Checklist

Comments on the Issue (you are @copilot in this section)

Uh oh!

Will-D-AER commented Dec 10, 2025

Uh oh!

Copilot AI commented Dec 10, 2025

Uh oh!

wd7512 commented Dec 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Dec 9, 2025 •

edited

Loading