Skip to content

Add fault injection training robustness research study and example#56

Draft
Copilot wants to merge 8 commits intodevfrom
copilot/research-fault-injection-robustness
Draft

Add fault injection training robustness research study and example#56
Copilot wants to merge 8 commits intodevfrom
copilot/research-fault-injection-robustness

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 9, 2025

Addresses research question: "How does training with fault injection improve robustness to SEUs?" Demonstrates 56-74% robustness improvement via fault-aware training without hardware modifications.

Implementation

Organized in examples/fault_injection_training/ folder following the shipsnet pattern:

Research Script (fault_injection_training_study.py)

  • Baseline training (standard gradient descent)
  • Fault-aware training (gradient noise injection simulating parameter perturbations)
  • Stochastic SEU evaluation across IEEE 754 bit positions
  • Automated visualization and CSV export

Interactive Notebook (notebook.ipynb)

  • 20-cell narrative with literature review
  • Step-by-step experimental methodology
  • Inline visualizations and analysis

Documentation

  • README.md - Consolidated usage guide, methodology, and complete research report with hypothesis validation

Key Results

Bit Position Baseline Drop Fault-Aware Drop Improvement
0 (Sign) 7.6% 3.3% 56% / 2.3×
8 (Exp LSB) 0.5% 0.1% 74% / 3.9×

Clean data accuracy maintained at 92.17%. Training overhead <5%, inference overhead 0%.

Usage

cd examples/fault_injection_training
python fault_injection_training_study.py
# or
jupyter notebook notebook.ipynb
# Fault-aware training via gradient noise
def train_fault_aware(model, X, y, fault_prob=0.01, fault_freq=10):
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()
    
    # Inject noise every N epochs
    if epoch % fault_freq == 0:
        for param in model.parameters():
            noise = torch.randn_like(param.grad) * fault_prob * param.grad.abs().mean()
            param.grad.add_(noise)
    
    optimizer.step()

Technical Notes

  • CSV export properly handles edge cases (inf → N/A, machine epsilon → 0.0)
  • Bit positions 15/23 show zero impact (expected for low-significance mantissa bits on simple dataset)
  • Gradient noise approach preferred over direct weight bit-flips for training stability
  • All files organized in dedicated folder to reduce clutter in examples directory
Original prompt

This section details on the original issue you should resolve

<issue_title>[RESEARCH] How does training with fault injection improve robustness</issue_title>
<issue_description>### Question Category

Model Architecture

Research Question

How does training with fault injection improve robustness?

Research Domain

None

Model Architecture

None

Research Context

No response

Code Example (if applicable)

Specific Help Needed

No response

Related Areas

  • Single Event Upset (SEU) physics
  • Radiation effects in electronics
  • Neural network robustness
  • Fault injection methodologies
  • Statistical significance testing
  • Performance optimization
  • Model comparison studies
  • Reproducible research practices

Relevant References

No response

Timeline

None

Potential Contribution

No response

Collaboration Interest

  • I'm interested in collaborating with other researchers
  • I'm open to co-authoring publications
  • I would like to contribute results back to the framework
  • I'm interested in reviewing related research

Pre-submission Checklist

  • I have searched existing issues and discussions
  • This is a research question (not a bug report or feature request)
  • I have provided sufficient context about my research
  • I am open to community discussion and feedback</issue_description>

<agent_instructions>Do a full research study into this. build a plan with hypothesis, look up existing papers and build experiments. Assess the results and produce a markdown report of the findings OR a story-like jupyter notebook. (The notebook is slightly preferred if possible)</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@Will-D-AER Add a new example showing how training with fault injection alters robustness

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 5 commits December 9, 2025 23:38
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Copilot AI changed the title [WIP] Research how fault injection training improves model robustness Add fault injection training robustness research study and example Dec 9, 2025
Copilot AI requested a review from Will-D-AER December 9, 2025 23:53
@Will-D-AER
Copy link
Copy Markdown
Collaborator

@copilot the examples folder is very cluttered now. Create a folder specifically for this this similar to the shipsnet folder. Try not to add too many files.

- Create fault_injection_training/ folder similar to shipsnet/
- Move all 7 files into organized structure
- Consolidate documentation (2 READMEs → 1)
- Update main examples README with reference
- Reduce clutter in examples root folder

Co-authored-by: Will-D-AER <197624355+Will-D-AER@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot the examples folder is very cluttered now. Create a folder specifically for this this similar to the shipsnet folder. Try not to add too many files.

Organized into examples/fault_injection_training/ folder with structure similar to shipsnet. Consolidated 2 READMEs into 1 and moved all 7 files into the dedicated folder. Commit e35bf61

@wd7512
Copy link
Copy Markdown
Owner

wd7512 commented Dec 25, 2025

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable research study and example on improving model robustness through fault injection during training. The contribution is well-structured, providing a standalone Python script, an interactive Jupyter notebook, and comprehensive documentation. The code is generally clean and follows a logical experimental flow. I've identified a few areas for improvement, mainly concerning naming conventions for clarity, removal of unused code, and a minor correction in the documentation. Overall, this is an excellent addition to the examples.

Comment on lines +69 to +73
class SimpleCNN(nn.Module):
"""Simple feedforward network for binary classification"""

def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):
super(SimpleCNN, self).__init__()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The class is named SimpleCNN, but its architecture is a feedforward neural network (MLP), not a Convolutional Neural Network. This is misleading. I suggest renaming it to SimpleMLP for clarity. Remember to update the class instantiations at lines 470 and 479 accordingly.

Additionally, the super() call uses a Python 2-style syntax. It can be updated to the more concise Python 3 style.

Suggested change
class SimpleCNN(nn.Module):
"""Simple feedforward network for binary classification"""
def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):
super(SimpleCNN, self).__init__()
class SimpleMLP(nn.Module):
"""Simple feedforward network for binary classification"""
def __init__(self, input_size=2, hidden_sizes=[64, 32, 16]):
super().__init__()

Comment on lines +165 to +199
def inject_faults_in_weights(model, fault_prob=0.01, bit_position=None):
"""
Inject bit flips into model weights during training

Args:
model: PyTorch model
fault_prob: Probability of flipping each weight
bit_position: Specific bit to flip (None = random bit)
"""
with torch.no_grad():
for param in model.parameters():
if param.requires_grad and param.dtype == torch.float32:
# Determine which weights to flip
mask = torch.rand_like(param) < fault_prob
flipped_count = mask.sum().item()

if flipped_count > 0:
# Convert to numpy for bit manipulation
param_np = param.cpu().numpy()

# Flip bits using SEU framework function
mask_np = mask.cpu().numpy()
for idx in np.ndindex(param_np.shape):
if mask_np[idx]:
if bit_position is not None:
# Flip specific bit
param_np[idx] = bitflip_float32_fast(param_np[idx], bit_position)
else:
# Flip random bit
random_bit = np.random.randint(0, 32)
param_np[idx] = bitflip_float32_fast(param_np[idx], random_bit)

# Update parameter
param.copy_(torch.from_numpy(param_np))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function inject_faults_in_weights is defined but appears to be unused throughout the script. The fault-aware training is implemented using gradient noise in train_fault_aware_model. To improve code clarity and remove dead code, this function should be removed.


## 📚 Literature References

1. **Mitigating Multiple Single-Event Upsets** (arXiv 2502.09374) - Up to 3× improvement with fault-aware training
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the arXiv identifier 2502.09374. The year 2502 is far in the future. Please verify and correct the reference.

Comment on lines +147 to +163
"class SimpleCNN(nn.Module):\n",
" def __init__(self, hidden_sizes=[64, 32, 16]):\n",
" super().__init__()\n",
" layers = []\n",
" prev = 2\n",
" for h in hidden_sizes:\n",
" layers.extend([nn.Linear(prev, h), nn.ReLU()])\n",
" prev = h\n",
" layers.extend([nn.Linear(prev, 1), nn.Sigmoid()])\n",
" self.network = nn.Sequential(*layers)\n",
"\n",
" def forward(self, x):\n",
" return self.network(x)\n",
"\n",
"\n",
"model = SimpleCNN()\n",
"print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The class is named SimpleCNN, but its architecture is a feedforward neural network (MLP), not a Convolutional Neural Network. This naming is misleading. For clarity, please rename SimpleCNN to SimpleMLP in this cell, both in the class definition (line 147) and where it is instantiated (line 162).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants