This repository contains scripts to process audio datasets, transcribe them using the Whisper model, fine-tune a Gemma-3-4B model for grammar scoring, and evaluate its performance on transcribed text. The project leverages modern machine learning libraries and tools like Unsloth, HuggingFace Transformers, and Whisper for efficient fine-tuning and speech-to-text transcription.
- Python 3.8 or higher
- A compatible GPU (recommended for faster processing; CUDA support required for GPU acceleration)
- Git installed to clone the repository
- Access to audio datasets for training and testing (not included in this repo)
To get started, clone this repository and install the required dependencies:
git clone https://github.com/Subhanshusethi/GrammarScoringEngine.git
cd GrammarScoringEngine
pip install -r requirements.txtThe requirements.txt file includes all necessary libraries, such as torch, transformers, unsloth, and others for model fine-tuning, audio transcription, and evaluation.
This repository contains the following key files:
-
requirements.txt- Lists all Python dependencies required for the project.
- Install using
pip install -r requirements.txt.
-
extract_text.py- A script to transcribe audio files using OpenAI's Whisper model (
whisper-large-v3-turbo) and prepare datasets for training and testing. - Outputs:
- Training data in JSON format (ShareGPT-style with system, user, and assistant roles).
- Test data transcriptions in CSV format.
- A script to transcribe audio files using OpenAI's Whisper model (
-
finetune_G_eval.py- A script to fine-tune a Gemma-3-4B model (quantized to 4-bit) on grammar scoring tasks and evaluate its performance.
- Performs:
- Model fine-tuning using LoRA (Low-Rank Adaptation) with the Unsloth library.
- Grammar score prediction on a test set.
- Evaluation metrics (MAE, RMSE, and accuracy) on a held-out test split.
This script processes audio files to generate transcribed text for training and testing.
python extract_text.py --train_csv_path <train_csv> --test_csv_path <test_csv> --train_audio_dir <train_dir> --test_audio_dir <test_dir>--train_csv_path: Path to a CSV file with columnsfilename(audio file names) andlabel(grammar scores).--test_csv_path: Path to a CSV file with afilenamecolumn (no labels required).--train_audio_dir: Directory containing training audio files.--test_audio_dir: Directory containing test audio files.--output_train_json(optional): Output path for training JSON (default:grammar_score_training_data_with_system.json).--output_test_csv(optional): Output path for test CSV (default:transcribed_test_set.csv).
python extract_text.py --train_csv_path data/train.csv --test_csv_path data/test.csv --train_audio_dir audio/train --test_audio_dir audio/test- Training data saved as a JSON file with system prompts and grammar scores.
- Test data saved as a CSV file with filenames and transcribed text.
This script fine-tunes a Gemma-3-4B model on the transcribed training data and evaluates its grammar scoring performance.
python finetune_G_eval.py --training_json_path <train_json> --input_csv_path <test_csv>--training_json_path: Path to the training JSON file generated byextract_text.py.--input_csv_path: Path to the test CSV file with transcribed text (fromextract_text.py).--output_csv_path(optional): Output path for the scored test CSV (default:grammar_scored_test_set.csv).--eval_model(optional): Boolean flag to enable/disable evaluation (default:True).
python finetune_G_eval.py --training_json_path grammar_score_training_data_with_system.json --input_csv_path transcribed_test_set.csv- Fine-tuned model (saved implicitly by the trainer; modify
SFTConfigin the script to save explicitly if needed). - Test set with predicted grammar scores saved as a CSV file.
- Evaluation metrics (MAE, RMSE, accuracy) printed for the held-out test split.
Below is a table comparing the performance of different models on Kaggle competition.
| Model | Parameters | Kaggle Scores |
|---|---|---|
| LLaMA 3.2 3B | 3B | 0.766 |
| LLaMA 3.2 1B | 1B | 0.71 |
| Gemma 4B | 4B | 0.802 |
| Gemma 1B | 1B | 0.782 |
- The Gemma 4B model (fine-tuned in this repo) is optimized for grammar scoring with LoRA and 4-bit quantization.
- Hardware Used: Fine-tuning and inference were performed on an NVIDIA L4 GPU with 12GB VRAM. This setup provides a good balance of memory and computational power for the Gemma-3-4B model quantized to 4-bit using LoRA. The L4 GPU supports efficient processing with the specified batch size (per_device_train_batch_size=4) and gradient accumulation steps (gradient_accumulation_steps=4).
- Dataset: You must provide your own audio datasets and corresponding CSV files. The scripts assume specific column names (
filename,label). - Model Downloads: The pre-trained unsloth/gemma-3-4b-it-unsloth-bnb-4bit model and openai/whisper-large-v3-turbo are downloaded from HuggingFace during execution, requiring an active internet connection.
