This repository contains Python scripts for extracting and analysing apprenticeship data from UK Department for Education (DfE) statistical releases.
Latest: Intelligent file discovery now automatically selects the most recent data files based on academic year and quarter/month patterns. See FILE_DISCOVERY.md for details.
Refactored: Codebase refactored for improved maintainability, reduced duplication, and better code quality. See REFACTORING.md for details.
Extracts Software Developer (Level 4) apprenticeship vacancy data from DfE vacancy CSV files and presents it in various formats suitable for analysis.
Features:
- Automatic file discovery: Finds and uses the most recent vacancy data file
- Filters vacancy data specifically for Software Developer apprenticeships
- Groups data by training provider and employer
- Provides multiple output formats (table, CSV, Markdown, TSV)
- Clean company name processing (removes legal suffixes like "Ltd", "PLC")
- Separates London vs other UK locations
- Aggregates small providers for better data presentation
Usage:
# Automatic discovery (uses most recent file)
python3 vacancies.py [options]
# Specify a file explicitly
python3 vacancies.py [options] [input_file]
Options:
--csv
,-c
: Output in CSV format (suitable for importing into databases)--table
: Output in table format (console-friendly aligned tables)--tsv
,-t
: Output in tab-separated format (for copy-paste into spreadsheets)--help
,-h
: Show help message
Default behaviour: Markdown table format using the most recent vacancy file
Output Format: Two tables showing:
- Providers Table: Training providers with employer count and total vacancies
- Employers Table: Detailed breakdown with employer, provider, location, and positions
The script intelligently groups data by:
- Detailed breakdown for providers with >10 apprenticeships
- Summary for providers with 4-10 apprenticeships
- Aggregated total for providers with ≤3 apprenticeships
Examples:
python3 vacancies.py # Markdown format, latest file
python3 vacancies.py --table # Console table format
python3 vacancies.py --csv # CSV format for import
python3 vacancies.py data/file.csv # Use specific file
Extracts apprenticeship starts data for a specific standard and presents it as a league table with years as columns and providers as rows.
Features:
- Automatic file discovery: Finds and uses the most recent starts data file
- Quarterly breakdown: Most recent year is broken down into Q1, Q2, Q3, Q4 columns
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Creates year-over-year comparison tables
- Shows providers with 3+ starts in most recent year separately
- Includes total row showing all starts across providers
- Automatically extracts from zip files if needed
Usage:
# Automatic discovery (uses most recent file)
python3 starts.py [options] [standard_code]
# Specify a file explicitly
python3 starts.py [options] [standard_code] [input_file]
Options:
--csv
,-c
: Output in CSV format--table
: Output in console table format--tsv
,-t
: Output in tab-separated format--help
,-h
: Show help message
Default Standard: ST0116
(Software Developer)
Output Format: League table showing:
- Total row: Combined starts across all providers by year and quarter
- Major providers: Providers with 3+ total starts in most recent year
- All other providers: Aggregated smaller providers
- Most recent year: Broken down into Q1, Q2, Q3, Q4 columns for detailed analysis
Examples:
python3 starts.py # ST0116 (Software Developer), latest file
python3 starts.py ST0113 # ST0113, latest file
python3 starts.py ST0116 data.csv # ST0116, specific file
python3 starts.py --table ST0116 # Console table format
python3 starts.py --csv ST0113 # CSV output
Extracts monthly apprenticeship starts data for a specific standard and presents it as a table with years as columns and months as rows (in academic year order: Aug-Jul).
Features:
- Automatic file discovery: Finds and uses the most recent monthly starts file
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Creates month-by-month comparison across years
- Displays months in academic year order (August to July)
- Includes total row showing annual totals
Usage:
# Automatic discovery (uses most recent file)
python3 monthly.py [options] [standard_code]
# Specify a file explicitly
python3 monthly.py [options] [standard_code] [input_file]
Options:
--csv
,-c
: Output in CSV format--table
: Output in console table format--tsv
,-t
: Output in tab-separated format--help
,-h
: Show help message
Default Standard: ST0116
(Software Developer)
Examples:
python3 monthly.py # ST0116, latest file
python3 monthly.py ST0113 # ST0113, latest file
python3 monthly.py ST0116 data.csv # ST0116, specific file
python3 monthly.py --table ST0113 # ST0113, table format
All scripts automatically discover and use the most recent data files based on:
- Academic year (e.g., 2024-25 is newer than 2023-24)
- Quarter/month (e.g., Q3 is newer than Q2, Nov is newer than Mar)
Files are found in:
- Current directory
apprenticeships_*/supporting-files/
folders
Supported filename patterns:
- Quarterly:
app-underlying-data-{type}-{year}-q{1-4}.csv
- Example:
app-underlying-data-vacancies-202425-q2.csv
- Example:
- Monthly:
app-underlying-data-{type}-{year}-{month}.csv
- Example:
app-underlying-data-monthly-202425-mar.csv
- Example:
See FILE_DISCOVERY.md for complete documentation.
The project uses a modular architecture with shared utilities:
vacancies.py
,starts.py
,monthly.py
- Main analysis scriptsutils.py
- Shared utilities (name cleaning, file discovery, table formatting)config.py
- Configuration constants (thresholds, field names, patterns)test_utils.py
- Unit tests for utility functionstest_file_discovery.py
- Tests for file discovery logic
These scripts work with CSV files downloaded from the DfE's apprenticeship statistics releases:
Download the "Underlying data" files and place them in:
- Root directory, or
apprenticeships_YYYY-YY/supporting-files/
folders
Scripts automatically find and use the most recent files.
All scripts support multiple output formats optimised for different use cases:
Format | Use Case | Option |
---|---|---|
Markdown | Documentation, reports, Notion inline tables | Default |
CSV | Import into databases, spreadsheets | --csv |
TSV | Copy-paste into existing tables | --tsv |
Table | Console viewing, terminal output | --table |
Runtime:
- Python 3.6+
- Standard library only (no external dependencies)
Development (optional):
pip3 install -r requirements.txt
Includes:
- pytest - for running tests
- mypy - for type checking (optional)
- black - for code formatting (optional)
- flake8 - for linting (optional)
Run the test suite to verify functionality:
# Run all tests
python3 test_utils.py
python3 test_file_discovery.py
# Or with pytest (if installed)
pytest test_*.py -v
Thresholds and settings can be adjusted in config.py
:
# Provider categorisation thresholds
VACANCY_LARGE_PROVIDER_THRESHOLD = 10 # Providers with >10 positions
VACANCY_MEDIUM_PROVIDER_MIN = 4 # Providers with 4-10 positions
VACANCY_SMALL_PROVIDER_MAX = 3 # Providers with ≤3 positions
# Starts analysis
STARTS_MIN_THRESHOLD = 3 # Minimum starts to show separately
# Standard codes
DEFAULT_STANDARD_CODE = 'ST0116' # Software Developer Level 4
- README.md (this file) - Overview and usage
- CLAUDE.md - Instructions for Claude Code development
- REFACTORING.md - Details of refactoring improvements
- FILE_DISCOVERY.md - Intelligent file discovery documentation
- requirements.txt - Development dependencies
# 1. Download latest DfE data files
# Place in root or apprenticeships_2024-25/supporting-files/
# 2. Run analysis scripts (automatically use latest files)
python3 vacancies.py --table
python3 starts.py ST0116 --csv
python3 monthly.py --tsv
# 3. Output can be redirected to files
python3 vacancies.py --csv > vacancies_output.csv
python3 starts.py --table ST0116 > starts_report.txt
# Software Developer (Level 4)
python3 starts.py ST0116
python3 monthly.py ST0116
# Data Analyst (Level 4)
python3 starts.py ST0118
python3 monthly.py ST0118
# Cyber Security Technologist (Level 3)
python3 starts.py ST0622
python3 monthly.py ST0622
# Use specific older file
python3 vacancies.py apprenticeships_2023-24/supporting-files/app-underlying-data-vacancies-202324-q4.csv
# Compare different quarters
python3 starts.py ST0116 app-data-starts-202324-q4.csv > q4_2023.txt
python3 starts.py ST0116 app-data-starts-202425-q2.csv > q2_2024.txt
diff q4_2023.txt q2_2024.txt
Solution:
- Ensure files are named correctly:
app-underlying-data-{type}-{year}-{quarter}.csv
- Check files are in root directory or
apprenticeships_*/supporting-files/
- Verify year format:
202425
not2024-25
Debug:
from utils import find_latest_file
print(find_latest_file('app-underlying-data-vacancies-*.csv'))
Solution: Specify file explicitly:
python3 vacancies.py path/to/specific/file.csv
Check:
- Verify standard code is correct (e.g.,
ST0116
notST116
) - Ensure CSV file contains data for the specified standard
- Check CSV field names match expected format
When modifying the code:
- Add configuration to
config.py
(not hardcoded in scripts) - Add shared logic to
utils.py
- Write tests for new functionality
- Use type hints on all functions
- Follow the coding standards in
CLAUDE.md
This code is provided for analysing publicly available DfE apprenticeship statistics.
For questions about the DfE data:
For issues with these scripts:
- Review the documentation files in this repository
- Check the test files for usage examples