This repository contains work by @bennettoxford/team-rsi to survey how "variables" are used by researchers in OpenSAFELY study.
A "variable" is a demographic or clinical feature of interest within an Electronic Health Record (EHR) that is used within an EHR study.
The work in this repository encompasses studies written using the deprecated cohort-extractor framework, and those using its replacement - ehrQL.
The main entry point for cohort-extractor related code in this repository is main.py which is most conveniently run using the just command, just run.
just run fetch will fetch all variables from all studies in all repositories in the opensafely organisation.
This requires a suitable GitHub Personal Access Token (PAT),
more information on this is available in the Developer Documentation.
just run notebook will start a local Marimo notebook server,
from which the names.py and definitions.py notebooks can be accessed.
These notebooks contain the analysis of variable names and definitions performed as part of this work,
and can be used a starting point for any desired future analysis.
We catalogue the variables defined in ehrql datasets across public studies in
the opensafely GitHub organisation. The main script searches for dataset
pipelines declared in each study's project.yaml, clones the matching
repositories, evaluates the dataset definitions with a spoofed runtime
environment, and emits structured summaries of every variable that is declared.
# Run the full ehrQL extraction process
just ehrql
# Run the ehrQL extraction process with verbose logging
just ehrql --verbose
# Run the ehrQL extraction process with a specific study
just ehrql study-name- Invokes the GitHub CLI (
gh) to search forproject.yamlfiles that referenceehrqldataset generation commands within theopensafelyorganisation. - Clones each matching repository into
.ehrql_repo_cache/(one shallow clone per commit SHA) and extracts the dataset authoring files listed inproject.yaml. - Imports each dataset module inside a controlled environment that spoofs
external inputs (files, command-line arguments, parameters, and selected
ehrqlbehaviours) so that variable definitions can be resolved without the original study artefacts. - Records every discovered variable including the dataset file, variable name,
inferred
ehrqlseries type, source line number (where available), and two deterministic hashes of the compiled query-model node. - Writes the aggregated results to
data/ehrql_variables.jsonand captures the hashed query-model lookup table inehrql_qm_dump.json.
ehrql_extractor.py– end-to-end collector used to discover and execute dataset definitions.spoofed_data/– configurable fixtures used to satisfy dataset expectations (JSON, CSV/Arrow data, parameters, and CLI arguments).data/ehrql_variables.json/ehrql_qm_dump.json– outputs generated by the collector.
ehrql_variables.json captures a timestamped snapshot of every processed
project:
{
"generated_at": "YYYY-MM-DD HH:MM:SS",
"projects": {
"opensafely/example-study": {
"sha": "commitsha",
"files": {
"analysis/dataset.py": [
["variable_name", "StrPatientSeries", 42, "expr_hash", "expr_hash_no_codes"]
]
}
}
}
}series_typereflects the runtime type reported byehrql(e.g.StrPatientSeries).line_nois either an integer line number where the variable is defined, or a tuple of filename and line number if the variable is defined in a different file to the dataset module.expr_hashis a deterministic SHA-256 hash of theehrqlquery-model node;expr_hash_no_codesremoves code sets to normalise across codelist variants.
The data/ehrql_variables.json file is currently manually copy/pasted into this repository
and made available via a tool deployed here.
ehrql_qm_dump.json maps every expr_hash_no_codes to the corresponding query
model string which might be useful for future debugging, but is massive and so
not committed to git.
Many studies reference generated files, runtime parameters, or CLI arguments that
are not committed to their repositories. The collector fills these gaps by
providing stand-ins under spoofed_data/:
args.json– fake command-line arguments keyed by repository. Add new entries if a study terminates early because an expected option is missing.parameters.json– values returned whenehrql.get_parameter()is invoked. Repository-specific overrides can be provided alongside adefault.json_data.jsonandcsv_data.csv– minimal payloads returned whenever datasets open JSON or CSV files (including the gzipped and Arrow derivatives generated automatically).
Update the CSV or JSON fixtures as new columns or properties are required. The .csv.gz and .arrow
files are regenerated from csv_data.csv automatically by the main script each time it is run.
A companion script ehrql_codelist_extractor.py extracts codelist_from_csv()
calls for each variable defined in ehrQL datasets. Unlike the main extractor,
this script uses static AST parsing only (no module execution) to find codelist
references. It also allows filtering to specific commits in repositories so you
can extract codelists that were actually used in studies.
# Run the full codelist extraction process for the latest commit of each ehrQL repo
just codelists
# Run with verbose logging
just codelists --verbose
# Extract codelists from specific repositories
just codelists opensafely/repo1 opensafely/repo2
# Specify a csv input file with repo/commit pairs (one per line) to run the extraction against
just codelists --csv All.jobs-data-2025-11-10.18_25_26.csv
# Specify a different output file
just codelists --output my_codelists.json
# Force the update rather than reusing the cached output
just codelists --forceNB for the periodic refreshes of this data you need to do the following:
- Update the
rsi-codelists-analysis.json
TODO - this is now out of data given recent changes to this file - we need to update these instructions
# Assumes starting from this directory and that the opencodelists repo is checked out at the same level as this one
curr_dir=$(pwd)
ocl_dir="../opencodelists"
# Goto opencodelists repo
cd $ocl_dir
# Checkout branch
git checkout Jongmassey/dump-rsi-json
# Get latest backup of prod db (can't use sanitised)
scp dokku3:/var/lib/dokku/data/storage/opencodelists/backup/db/latest-db.sqlite3.zst .
zstd -d latest-db.sqlite3.zst -o latest-db.sqlite3
# Run script to dump rsi json
. .venv/bin/activate # avoid running in the wrong python environment
DATABASE_URL=sqlite:///latest-db.sqlite3 just manage runscript dump_rsi_json
# Move the output back to this repo
cp rsi-codelists-analysis.json $curr_dir/data/rsi-codelists-analysis.json
# Cleanup the ocl directory
rm rsi-codelists-analysis.json
rm latest-db.sqlite3
rm latest-db.sqlite3.zst
git checkout main
# Return here
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo- Update the all jobs data csv
# ASSUMES YOU ARE USING THE DOCKER VERSION OF POSTGRES IN THE JOB-SERVER REPO
# IF NOT THEN CHECK OUT THE DEVELOPERS DOC IN THE JOB-SERVER REPO
# Assumes starting from this directory and that the job-server repo is checked out at the same level as this one
curr_dir=$(pwd)
job_server_dir="../job-server"
# Goto job-server repo
cd $job_server_dir
# [Restore](https://github.com/opensafely-core/job-server/blob/main/DEVELOPERS.md#restoring-backups) a new copy of the job-server database dump. Currently this is:
. .venv/bin/activate # avoid running in the wrong python environment
scp dokku4:/var/lib/dokku/data/storage/job-server/jobserver.dump jobserver.dump
just docker/restore-db jobserver.dump
# Execute query to get the csv data
just docker/exec db "psql -U user -d jobserver -c \"\copy (select r.url, jr.sha, to_char(date_trunc('second', jr.created_at) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS') as created_at, jr._status from jobserver_workspace w join jobserver_repo r on w.repo_id = r.id join jobserver_jobrequest jr on jr.workspace_id = w.id order by r.url, jr.created_at, jr.sha) to stdout with csv header\" > all_jobs.csv"
# Move the output back to this repo
cp docker/all_jobs.csv $curr_dir/data/
# Cleanup the job-server directory
rm docker/all_jobs.csv
# Return here
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo
- Rerun the script
just codelists --csv data/all_jobs.csv- Copy to the rsi-prototypes repo and update the tool
# Assuming you are in this directory and have the rsi-prototypes repo checked out at the same level as this one
rsi_dir="../rsi-prototypes"
cp data/ehrql_codelists.json $rsi_dir/scripts/data-processing/codelist-explorer/ehrql_codelists.json
cp data/rsi-codelists-analysis.json $rsi_dir/scripts/data-processing/codelist-explorer/rsi-codelists-analysis.json
cp data/all_jobs.csv $rsi_dir/scripts/data-processing/codelist-explorer/all_jobs.csv
# Then process the files as follows
cd $rsi_dir
. .venv/bin/activate # activate the python environment for the rsi-prototypes repo
python scripts/data-processing/codelist-explorer/preprocess.py
# Should cause codelist-explorer-data.json and .json.br to be updated
# Optionally test locally with
just run # then open localhost:5000
# Deploy by committing the changes to the rsi-prototypes repo and pushing to the main branch, which should trigger the digital ocean deployment
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo
If you want to specify particular commits to extract codelists from, you can
provide a CSV file with two columns: url and sha. Each row should contain the
repository URL and the specific commit SHA to extract codelists from. For example:
url,sha
https://github.com/opensafely/example-study,commitsha1
https://github.com/opensafely/another-study,commitsha2The filename of the csv can be provided to the script using the --csv argument as shown above.
ehrql_codelists.json captures codelist references for each variable. Often, the variables and
codelists in a study don't change across multiple runs of a study across different commits. To
avoid duplicating this output we proceed as follows. The output list of variables and associated
codelists for a given study commit is hashed. We then map the commit sha to this hash. The actual
detail of the variables and codelists is only stored once per unique hash later on in the output file.
The precise format is as follows:
{
"generated_at": "YYYY-MM-DD HH:MM:SS",
"projects": {
"opensafely/example-study": {
"commit_sha_1": "variables_hash_value_1",
"commit_sha_2": "variables_hash_value_2",
"commit_sha_3": "variables_hash_value_1",
...
},
...
},
"signatures": {
"variables_hash_value_1": {
"analysis/dataset.py": {
"_unused_codelists": [
["codelist_url", "file=path_to_csv_file"]
],
"variable_name_1": [
["codelist_file.csv", "column_name=code", "another_arg=value"],
["another_codelist_file.csv"]
],
"variable_name_2": []
},
"another_file.py": {
"variable_name_3": [
["codelist_file.csv"]
]
}
},
"variables_hash_value_2": {
...
}
}
}The current approach finds most codelists but isn't perfect. An alternative would be to parse the files to find the ehrQL variables first, then re-parse to find codelists only for those variables. We could also use the _qm_node property of the parsed ehrQL dataset to find codelist references. The complication there is that which the _qm_node keeps track that a codelist was used, it only retains the list of codes and not the name, so a reverse lookup would be needed to find which codelist produced those codes. This would be more complex, and might still not be complete, but could be more accurate.
We populate the codelist quality report with the codelist_audit.json file generated as follows.
# 1. Ensure the `rsi-codelists-analysis.json` file is updated as per the instructions above.
# 2. Update the codelist dump as follows:
just codelists --output ehrql_codelists_latest.json
# 3. Update the zero count data
python codelist_zero_counts.py
# 4. Update the codelist audit data file
python codelist_audit_report.pyPlease see the additional information.