Hybrid Feature-Based Detection of AI-Generated Text

The accompanying paper can be read here: project-report/NLP-AI-Detection-2025-final.pdf

This repository implements a project exploring linguistic and statistical features for distinguishing human- vs. AI-generated text. We evaluate individual features (e.g., sentence complexity, punctuation, lexical diversity, perplexity, burstiness) and hybrid combinations/ensembles using lightweight classifiers like logistic regression and XGBoost. Hybrids often outperform single methods in robustness and generalization, as per recent surveys (Wu et al., 2025; Su & Wu, 2024).

Research Question

Which linguistic or statistical features are most effective in distinguishing human- vs. AI-generated text?

Method

Extract features using tools like language models (perplexity), tokenizers (lexical diversity), and POS taggers (ratios).
Train classifiers on features; perform ablation studies and explore ensembles (e.g., stacking/voting).
Analyze feature importances.

Datasets

To generalize across sources/domains/LLMs and avoid overfitting:

AI Text Detection Pile (Hugging Face): ~1.3M samples, diverse texts (Reddit, Wikipedia, books) (Artem9k, 2023).
AH&AITD (Arslan's Human and AI Text Database): ~12k samples, texts across domains (articles, abstracts, stories, news, reviews) (Akram, 2023).
LLM - Detect AI Generated Text Dataset (Kaggle sunilthite): >28k essays (Thite, 2023).
LLM - Detect AI Generated Text (Kaggle Competition): ~10k essays (Kaggle, 2023).
DAIGT V2 Train Dataset (Kaggle): 48k samples, essays (Deotte, 2023).
Human ChatGPT Comparison Corpus (HC3): ~24k rows, Q&A from domains (finance, medicine, law, open) (Guo et al., 2023).

Evaluation

Classifiers assessed via accuracy, F1-score, AUROC, and precision at false-positive rates. Feature importance via coefficients, SHAP values (Lundberg, 2017), or permutation importance. Hybrids benchmarked against baselines.

Requirements

Conda (for environment management)
huggingface with an account to access datasets

conda create -n ai-text-detection python=3.12
conda activate ai-text-detection

conda install numpy pandas scikit-learn torch transformers
pip install -r requirements.txt

# then for CUDA 
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128

# if running on laptop without GPU, use CPU version
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cpu


#install Spacey models
python -m spacy download en

Usage

Log into hf using Run get_datasets.py to download datasets: python get_datasets.py.
Run feature extraction: `python extract_linguistic_features.py --dataset --output . If no path, defaults are used.
Train/evaluate: python train.py --model xgboost --features all.
Analyze: python analyze_importance.py.
Log into Hugging Face and download datasets: python get_datasets.py.
(Optional) Pre-calculate perplexity values: python preprocess_perplexity_sources.py --all. This requires CUDA-enabled GPU but provides faster training later. Pre-calculated perplexity values are included for most datasets (~172k samples).
Configure dataset_configuration.toml for dataset combination (e.g., enable sources, set total samples, portions), then combine datasets: python combine_dataset.py.
Configure configuration.toml for model settings (e.g., num_samples, enabled features, voting, feature params), then train and evaluate using: python main.py.

BERT Model Training and Loading

Training and Saving a Model

To train a BERT model (requires CUDA to train efficiently) and save it automatically, ensure these settings in configuration.toml:

[features.bert_classifier]
use_pretrained = false      # Train from scratch
save_after_training = true  # Save after training
model_path = "models/bert_classifier.pt"

Run training:

python main.py

The trained model will then be saved to models/bert_classifier.pt.

Loading a Pre-trained Model

To load a previously trained model instead of training from scratch:

[features.bert_classifier]
use_pretrained = true       # Load existing model
model_path = "models/bert_classifier.pt"

Run inference:

python main.py

Using Pre-trained Models with Git LFS

Model files are automatically tracked using Git LFS (Large File Storage). To share models:

First-time setup (already done for this repo):
```
git lfs install
```
Migrate existing models to LFS (one-time, if you had models before LFS was set up):
```
git add models/*.pt
git commit -m "Migrate model files to Git LFS"
```
Note: When you add .gitattributes to track .pt files with LFS, Git will convert existing model files from regular Git storage to LFS pointers. This is a one-time migration and will show the files as modified.

Push a trained model:

git add models/bert_classifier.pt
git commit -m "Add trained BERT model"
git push

Pull a model from the repository:
```
git lfs pull
```

Model files (*.pt, *.pth) are automatically managed by Git LFS, so teammates can easily access pre-trained models without re-training.

Contributors

Finn Fonteijn
Linn Gregussen

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
correlation_checking		correlation_checking
data		data
evaluation		evaluation
features		features
models		models
plots		plots
project-report		project-report
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
combine_dataset.py		combine_dataset.py
configuration.toml		configuration.toml
dataset_configuration.toml		dataset_configuration.toml
ensemble.py		ensemble.py
extract_linguistic_features.py		extract_linguistic_features.py
get_datasets.py		get_datasets.py
linguistic_feature_model_utils.py		linguistic_feature_model_utils.py
main.py		main.py
perplexity_calc.py		perplexity_calc.py
preprocess_perplexity_sources.py		preprocess_perplexity_sources.py
requirements.txt		requirements.txt
run_logistic_regression.py		run_logistic_regression.py
run_random_forest.py		run_random_forest.py
run_xgboost.py		run_xgboost.py
validate_pipeline.py		validate_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Feature-Based Detection of AI-Generated Text

Research Question

Method

Datasets

Evaluation

Requirements

Usage

BERT Model Training and Loading

Training and Saving a Model

Loading a Pre-trained Model

Using Pre-trained Models with Git LFS

Contributors

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hybrid Feature-Based Detection of AI-Generated Text

Research Question

Method

Datasets

Evaluation

Requirements

Usage

BERT Model Training and Loading

Training and Saving a Model

Loading a Pre-trained Model

Using Pre-trained Models with Git LFS

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages