PHASE 2: FEATURE ENGINEERING - PLAN

Goal: Convert raw text prompts into numerical vectors for machine learning models.

🛠 Feature Sets to Extract

Simple but effective features based on text statistics.

Length: Character count, word count
Capitalization: Ratio of uppercase letters (shouting/imperatives)
Punctuation: Count of !, ?, ", {, } (often used in code injections)
Special Characters: Count of @, #, $, %
Keyword Presence: Binary flags for suspicious words:
- ignore, override, system, developer, model, prior
- jailbreak, mode, unfiltered, dan

Detecting the structure of attacks.

Role Play Indicators: Matches for You are, Act as, Pretend
Chat Template Injection: Matches for User:, Assistant:, System:
Code Injection: Presence of code blocks or programming syntax (def , import , print()

Capturing the meaning of the text.

Sentence-BERT Embeddings: Use all-MiniLM-L6-v2 (fast & effective)
- Output: 384-dimensional vector for each prompt

Create src/feature_extraction.py:
- Class FeatureExtractor
- Setup scikit-learn transformers and pipelines
- Setup sentence-transformers model
Process Data:
- Load data/processed/prompts.csv
- Apply extraction in batches (embeddings can be slow)
- Combine all features into a single feature matrix ($X$)
Save Features:
- Save $X$ (features) and $y$ (labels) as .npz (NumPy arrays) or .parquet
- data/processed/X_features.npz
- data/processed/y_labels.npy

python src/feature_extraction.py