Goal: Convert raw text prompts into numerical vectors for machine learning models.
Simple but effective features based on text statistics.
- Length: Character count, word count
- Capitalization: Ratio of uppercase letters (shouting/imperatives)
- Punctuation: Count of
!,?,",{,}(often used in code injections) - Special Characters: Count of
@,#,$,% - Keyword Presence: Binary flags for suspicious words:
ignore,override,system,developer,model,priorjailbreak,mode,unfiltered,dan
Detecting the structure of attacks.
- Role Play Indicators: Matches for
You are,Act as,Pretend - Chat Template Injection: Matches for
User:,Assistant:,System: - Code Injection: Presence of code blocks or programming syntax (
def,import,print()
Capturing the meaning of the text.
- Sentence-BERT Embeddings: Use
all-MiniLM-L6-v2(fast & effective)- Output: 384-dimensional vector for each prompt
-
Create
src/feature_extraction.py:- Class
FeatureExtractor - Setup
scikit-learntransformers and pipelines - Setup
sentence-transformersmodel
- Class
-
Process Data:
- Load
data/processed/prompts.csv - Apply extraction in batches (embeddings can be slow)
- Combine all features into a single feature matrix (
$X$ )
- Load
-
Save Features:
- Save
$X$ (features) and$y$ (labels) as.npz(NumPy arrays) or.parquet data/processed/X_features.npzdata/processed/y_labels.npy
- Save
- Lexical/Structural: < 1 minute
- Embeddings: ~15-30 minutes (on CPU) for 90k samples
python src/feature_extraction.py