Skip to content

bennettoxford/opensafely-variable-survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

163 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSAFELY Variable Survey

This repository contains work by @bennettoxford/team-rsi to survey how "variables" are used by researchers in OpenSAFELY study.

A "variable" is a demographic or clinical feature of interest within an Electronic Health Record (EHR) that is used within an EHR study.

The work in this repository encompasses studies written using the deprecated cohort-extractor framework, and those using its replacement - ehrQL.

cohort-extractor studies

The main entry point for cohort-extractor related code in this repository is main.py which is most conveniently run using the just command, just run.

just run fetch will fetch all variables from all studies in all repositories in the opensafely organisation. This requires a suitable GitHub Personal Access Token (PAT), more information on this is available in the Developer Documentation.

just run notebook will start a local Marimo notebook server, from which the names.py and definitions.py notebooks can be accessed. These notebooks contain the analysis of variable names and definitions performed as part of this work, and can be used a starting point for any desired future analysis.

ehrQL studies

We catalogue the variables defined in ehrql datasets across public studies in the opensafely GitHub organisation. The main script searches for dataset pipelines declared in each study's project.yaml, clones the matching repositories, evaluates the dataset definitions with a spoofed runtime environment, and emits structured summaries of every variable that is declared.

Execution

# Run the full ehrQL extraction process
just ehrql

# Run the ehrQL extraction process with verbose logging
just ehrql --verbose

# Run the ehrQL extraction process with a specific study
just ehrql study-name

How It Works

  1. Invokes the GitHub CLI (gh) to search for project.yaml files that reference ehrql dataset generation commands within the opensafely organisation.
  2. Clones each matching repository into .ehrql_repo_cache/ (one shallow clone per commit SHA) and extracts the dataset authoring files listed in project.yaml.
  3. Imports each dataset module inside a controlled environment that spoofs external inputs (files, command-line arguments, parameters, and selected ehrql behaviours) so that variable definitions can be resolved without the original study artefacts.
  4. Records every discovered variable including the dataset file, variable name, inferred ehrql series type, source line number (where available), and two deterministic hashes of the compiled query-model node.
  5. Writes the aggregated results to data/ehrql_variables.json and captures the hashed query-model lookup table in ehrql_qm_dump.json.

Key files

  • ehrql_extractor.py – end-to-end collector used to discover and execute dataset definitions.
  • spoofed_data/ – configurable fixtures used to satisfy dataset expectations (JSON, CSV/Arrow data, parameters, and CLI arguments).
  • data/ehrql_variables.json / ehrql_qm_dump.json – outputs generated by the collector.

Output Format

ehrql_variables.json captures a timestamped snapshot of every processed project:

{
  "generated_at": "YYYY-MM-DD HH:MM:SS",
  "projects": {
    "opensafely/example-study": {
      "sha": "commitsha",
      "files": {
        "analysis/dataset.py": [
          ["variable_name", "StrPatientSeries", 42, "expr_hash", "expr_hash_no_codes"]
        ]
      }
    }
  }
}
  • series_type reflects the runtime type reported by ehrql (e.g. StrPatientSeries).
  • line_no is either an integer line number where the variable is defined, or a tuple of filename and line number if the variable is defined in a different file to the dataset module.
  • expr_hash is a deterministic SHA-256 hash of the ehrql query-model node; expr_hash_no_codes removes code sets to normalise across codelist variants.

The data/ehrql_variables.json file is currently manually copy/pasted into this repository and made available via a tool deployed here.

ehrql_qm_dump.json maps every expr_hash_no_codes to the corresponding query model string which might be useful for future debugging, but is massive and so not committed to git.

Spoofed Inputs and Adaptation

Many studies reference generated files, runtime parameters, or CLI arguments that are not committed to their repositories. The collector fills these gaps by providing stand-ins under spoofed_data/:

  • args.json – fake command-line arguments keyed by repository. Add new entries if a study terminates early because an expected option is missing.
  • parameters.json – values returned when ehrql.get_parameter() is invoked. Repository-specific overrides can be provided alongside a default.
  • json_data.json and csv_data.csv – minimal payloads returned whenever datasets open JSON or CSV files (including the gzipped and Arrow derivatives generated automatically).

Update the CSV or JSON fixtures as new columns or properties are required. The .csv.gz and .arrow files are regenerated from csv_data.csv automatically by the main script each time it is run.

Codelist Extraction

A companion script ehrql_codelist_extractor.py extracts codelist_from_csv() calls for each variable defined in ehrQL datasets. Unlike the main extractor, this script uses static AST parsing only (no module execution) to find codelist references. It also allows filtering to specific commits in repositories so you can extract codelists that were actually used in studies.

Execution

# Run the full codelist extraction process for the latest commit of each ehrQL repo
just codelists

# Run with verbose logging
just codelists --verbose

# Extract codelists from specific repositories
just codelists opensafely/repo1 opensafely/repo2

# Specify a csv input file with repo/commit pairs (one per line) to run the extraction against
just codelists --csv All.jobs-data-2025-11-10.18_25_26.csv

# Specify a different output file
just codelists --output my_codelists.json

# Force the update rather than reusing the cached output
just codelists --force

Refreshing the data

NB for the periodic refreshes of this data you need to do the following:

  1. Update the rsi-codelists-analysis.json

TODO - this is now out of data given recent changes to this file - we need to update these instructions

# Assumes starting from this directory and that the opencodelists repo is checked out at the same level as this one
curr_dir=$(pwd)
ocl_dir="../opencodelists"
# Goto opencodelists repo
cd $ocl_dir

# Checkout branch
git checkout Jongmassey/dump-rsi-json

# Get latest backup of prod db (can't use sanitised)
scp dokku3:/var/lib/dokku/data/storage/opencodelists/backup/db/latest-db.sqlite3.zst .
zstd -d latest-db.sqlite3.zst -o latest-db.sqlite3

# Run script to dump rsi json
. .venv/bin/activate # avoid running in the wrong python environment
DATABASE_URL=sqlite:///latest-db.sqlite3 just manage runscript dump_rsi_json

# Move the output back to this repo
cp rsi-codelists-analysis.json $curr_dir/data/rsi-codelists-analysis.json

# Cleanup the ocl directory
rm rsi-codelists-analysis.json
rm latest-db.sqlite3
rm latest-db.sqlite3.zst
git checkout main

# Return here
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo
  1. Update the all jobs data csv
# ASSUMES YOU ARE USING THE DOCKER VERSION OF POSTGRES IN THE JOB-SERVER REPO
# IF NOT THEN CHECK OUT THE DEVELOPERS DOC IN THE JOB-SERVER REPO

# Assumes starting from this directory and that the job-server repo is checked out at the same level as this one
curr_dir=$(pwd)
job_server_dir="../job-server"

# Goto job-server repo
cd $job_server_dir

# [Restore](https://github.com/opensafely-core/job-server/blob/main/DEVELOPERS.md#restoring-backups) a new copy of the job-server database dump. Currently this is:
. .venv/bin/activate # avoid running in the wrong python environment
scp dokku4:/var/lib/dokku/data/storage/job-server/jobserver.dump jobserver.dump
just docker/restore-db jobserver.dump

# Execute query to get the csv data
just docker/exec db "psql -U user -d jobserver -c \"\copy (select r.url, jr.sha, to_char(date_trunc('second', jr.created_at) AT TIME ZONE 'Europe/London', 'YYYY-MM-DD HH24:MI:SS') as created_at, jr._status from jobserver_workspace w join jobserver_repo r on w.repo_id = r.id join jobserver_jobrequest jr on jr.workspace_id = w.id order by r.url, jr.created_at, jr.sha) to stdout with csv header\" > all_jobs.csv"

# Move the output back to this repo
cp docker/all_jobs.csv $curr_dir/data/

# Cleanup the job-server directory
rm docker/all_jobs.csv

# Return here
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo
  1. Rerun the script
just codelists --csv data/all_jobs.csv
  1. Copy to the rsi-prototypes repo and update the tool
# Assuming you are in this directory and have the rsi-prototypes repo checked out at the same level as this one
rsi_dir="../rsi-prototypes"
cp data/ehrql_codelists.json $rsi_dir/scripts/data-processing/codelist-explorer/ehrql_codelists.json
cp data/rsi-codelists-analysis.json $rsi_dir/scripts/data-processing/codelist-explorer/rsi-codelists-analysis.json
cp data/all_jobs.csv $rsi_dir/scripts/data-processing/codelist-explorer/all_jobs.csv

# Then process the files as follows
cd $rsi_dir
. .venv/bin/activate # activate the python environment for the rsi-prototypes repo
python scripts/data-processing/codelist-explorer/preprocess.py

# Should cause codelist-explorer-data.json and .json.br to be updated
# Optionally test locally with
just run # then open localhost:5000

# Deploy by committing the changes to the rsi-prototypes repo and pushing to the main branch, which should trigger the digital ocean deployment
cd $curr_dir
. .venv/bin/activate # back to the python environment for this repo

CSV input

If you want to specify particular commits to extract codelists from, you can provide a CSV file with two columns: url and sha. Each row should contain the repository URL and the specific commit SHA to extract codelists from. For example:

url,sha
https://github.com/opensafely/example-study,commitsha1
https://github.com/opensafely/another-study,commitsha2

The filename of the csv can be provided to the script using the --csv argument as shown above.

Output Format

ehrql_codelists.json captures codelist references for each variable. Often, the variables and codelists in a study don't change across multiple runs of a study across different commits. To avoid duplicating this output we proceed as follows. The output list of variables and associated codelists for a given study commit is hashed. We then map the commit sha to this hash. The actual detail of the variables and codelists is only stored once per unique hash later on in the output file.

The precise format is as follows:

{
  "generated_at": "YYYY-MM-DD HH:MM:SS",
  "projects": {
    "opensafely/example-study": {
      "commit_sha_1": "variables_hash_value_1",
      "commit_sha_2": "variables_hash_value_2",
      "commit_sha_3": "variables_hash_value_1",
      ...
    },
    ...
  },
  "signatures": {
    "variables_hash_value_1": {
      "analysis/dataset.py": {
        "_unused_codelists": [
          ["codelist_url", "file=path_to_csv_file"]
        ],
        "variable_name_1": [
          ["codelist_file.csv", "column_name=code", "another_arg=value"],
          ["another_codelist_file.csv"]
        ],
        "variable_name_2": []
      },
      "another_file.py": {
        "variable_name_3": [
          ["codelist_file.csv"]
        ]
      }
    },
    "variables_hash_value_2": {
      ...
    }
  }
}

Notes

The current approach finds most codelists but isn't perfect. An alternative would be to parse the files to find the ehrQL variables first, then re-parse to find codelists only for those variables. We could also use the _qm_node property of the parsed ehrQL dataset to find codelist references. The complication there is that which the _qm_node keeps track that a codelist was used, it only retains the list of codes and not the name, so a reverse lookup would be needed to find which codelist produced those codes. This would be more complex, and might still not be complete, but could be more accurate.

Codelist audit

We populate the codelist quality report with the codelist_audit.json file generated as follows.

# 1. Ensure the `rsi-codelists-analysis.json` file is updated as per the instructions above.
# 2. Update the codelist dump as follows:
just codelists --output ehrql_codelists_latest.json

# 3. Update the zero count data
python codelist_zero_counts.py

# 4. Update the codelist audit data file
python codelist_audit_report.py

Developer docs

Please see the additional information.

About

Extraction and analysis of study variables in OpenSAFELY studies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors