Skip to content

Latest commit

Β 

History

History
95 lines (72 loc) Β· 2.43 KB

File metadata and controls

95 lines (72 loc) Β· 2.43 KB

AI Prompt Security Detection System

A machine learning system to detect Prompt Injection and Jailbreak attacks in Large Language Models (LLMs).

🎯 Project Goal

Build a preamble security classifier that acts as a guardrail before an LLM.

  • Input: User prompt
  • Output: ALLOW (Benign) or BLOCK (Malicious)

Labels:

  • 0 = Benign
  • 1 = Prompt Injection
  • 2 = Jailbreak

πŸš€ How to Run (Fresh Install)

Since dataset files and models are large, they are not stored in the repository. You must regenerate them using the provided scripts.

1. Setup Environment

# Clone the repository
git clone https://github.com/Harshith1004/ai_prompt_security.git
cd ai_prompt_security

# Run setup script (Mac/Linux)
chmod +x setup.sh
./setup.sh
source venv/bin/activate

2. Build Pipeline (Regenerate Data & Models)

Run these commands in order to create the models from scratch:

# 1. Download Datasets (Alpaca, TruthfulQA, Jailbreaks)
python src/download_datasets.py

# 2. Clean and Label Data
python src/data_cleaning.py

# 3. Extract Features (Lexical + S-BERT Embeddings)
# ⚠️ This takes ~2-5 minutes
python src/feature_extraction.py

# 4. Train Models (Logistic Regression, Random Forest, MLP)
python src/train_models.py

3. Run the Guardrail Demo

Once models are trained, you can use the interactive scanner:

python src/guardrail.py

πŸ“‚ Project Structure

ai_prompt_security/
β”œβ”€β”€ data/             # (Generated locally)
β”œβ”€β”€ models/           # (Generated locally)
β”œβ”€β”€ notebooks/        # Experiments
β”œβ”€β”€ results/          # Evaluation metrics & plots
β”œβ”€β”€ src/              # Source code
β”‚   β”œβ”€β”€ download_datasets.py   # Step 1
β”‚   β”œβ”€β”€ data_cleaning.py       # Step 2
β”‚   β”œβ”€β”€ feature_extraction.py  # Step 3
β”‚   β”œβ”€β”€ train_models.py        # Step 4
β”‚   β”œβ”€β”€ evaluate_models.py     # Step 5
β”‚   └── guardrail.py           # Demo App
└── requirements.txt

πŸ“Š Performance

Model Accuracy ROC-AUC
DistilBERT Head (MLP) 97.9% 0.99
Logistic Regression 96.2% 0.98
Random Forest 96.0% 0.98

πŸ“ Source Datasets

Malicious:

  • deepset/prompt-injections
  • Anthropic/hh-rlhf (Red Team)
  • rubend18/ChatGPT-Jailbreak-Prompts

Benign:

  • tatsu-lab/alpaca
  • truthful_qa

πŸ“ License

Research project for educational purposes.