Skip to content

Update db export script to generate PII scrubbed copy of the Job Server production database#5437

Draft
nishtha-kalra wants to merge 38 commits intomainfrom
nk/export-script
Draft

Update db export script to generate PII scrubbed copy of the Job Server production database#5437
nishtha-kalra wants to merge 38 commits intomainfrom
nk/export-script

Conversation

@nishtha-kalra
Copy link
Copy Markdown
Contributor

@nishtha-kalra nishtha-kalra commented Nov 17, 2025

Description

Issue - https://github.com/opensafely-core/security/issues/33

  • Updated the dump_db script to now generate a sanitised backup of database with scrubbed PII.
  • Updated the relevant document accordingly.
  • PR for ADR - ADR for PII scrubbed copy of Job Server database for local development #5461
  • Tested this locally by changing the DATABASE_URL in .env to point to my local db instead which had copy of dump from production.
  • Slack thread related to this can be found here for anyone interested

Manual testing on new setup:

  • clone and switch to branch nk/export-script
  • switch to virtual env
  • copy dotenv-sample and rename copy to .env
scripts/dev-env.sh .env
  • Start docker and install requirements
just docker/db
just devenv
  • In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump.
scp dokku4:/var/lib/dokku/data/storage/job-server/jobserver.dump jobserver.dump
createdb -h localhost -p 6543 -U user jobserver_prod 
pg_restore -h localhost -p 6543 -U user -d jobserver_prod --no-owner --no-acl jobserver.dump
  • Update DATABASE_URL in .env to point to this db
DATABASE_URL=postgres://user:pass@localhost:6543/jobserver_prod 
  • Build assets
just assets-rebuild
  • Load changes in .env and run the sanitised db job to create a sanitised dump.
    Make sure the OUTPUT_PATH mentioned in the script is available on your system. I temporarily change the path to ./sanitised_jobserver.dump.
set -a; source .env; set +a
just manage runjob dump_sanitised_db
  • Restore this sanitised db
just docker/restore-db sanitised_jobserver.dump
  • Update DATABASE_URL in .env to point back to db
DATABASE_URL=postgres://user:pass@localhost:6543/jobserver
  • Load changes in .env and run
set -a; source .env; set +a 
just manage migrate
just run
  • Go to Account -> staff area -> users -> select any user and check that the email is now fake
  • Attaching screenshot too for help
Screenshot from 2026-01-05 15-42-53

@nishtha-kalra nishtha-kalra changed the title Update db export script Update db export script to remove sensitive data Nov 18, 2025
@nishtha-kalra nishtha-kalra changed the title Update db export script to remove sensitive data Update db export script to generate PII scrubbed copy of the Job Server production database Dec 2, 2025
@nishtha-kalra nishtha-kalra marked this pull request as ready for review December 2, 2025 10:19
Copy link
Copy Markdown
Contributor

@mikerkelly mikerkelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall

This script seems like a well-structured solution to a very knotty problem that was complicated by our single-node infrastructure. I think the overall approach is good, given our constraints, and appreciate how you've broken down the main function into several smaller functions for structure. However, I do think there are several issues that must be resolved before we can merge this, some of them quite significant. I have some comments and questions below and inline. I appreciate this is quite a lot of feedback all at once, on a big complicated topic, so I've tried to keep it high level rather than too many line-by-line code comments.

Included/excluded fields

I think you decided which fields to include or exclude during this work. This was an area mentioned to be explored in the issue. How did you decide on what to include or exclude? It is not easy to analyse what is excluded from the allow list. There are many more inclusions than exclusions but the exclusions are the most significant information, and are only implicit in the allow list. They will also change over time as fields are added. How will someone looking at this in future understand what they are? Is there anything that can be documented about your thought process? If you used any temporary tooling for this, can it be preserved somewhere in case it is useful if we need to revisit this?

What will happen if a field is renamed, removed, or changes data type; or a developer forgets to add a new or changed one to the allow list? Will the script fail? Will the dump be valid?

I am concerned that _fake_expression may lead to errors if used on a table with a uniqueness constraint. I also think it assumes every table contains an id field when that may not be the case in future if we defined a different primary key field on a model, so I wonder if we should be robust to that? I also think we may want to end the program with an error if used on an unhandled data type, not return NULL, as I think this is unexpected and ought not continue silently. This could cause the script to fail if some of the changes mentioned above are not adjusted for. Probably it's better if the script fails and we hear about it.

I think Peter's suggestion from the Slack thread you link may address some of those issues, did you discard it? Also testing and alerting may mitigate it.

Testing

This script contributes to an IG requirement, the approach and details involve some complexity, and it makes potentially destructive changes directly in the production database. So I think strong testing here is important. This is also important so if we need to maintain or change the script we can do so confidently and efficiently.

Tested this locally by changing the DATABASE_URL in .env to point to my local db instead which had copy of dump from production.

We can possibly rely only on manual testing if automated testing is unfeasible, but we should record and execute a very thorough test plan for assurance in that case. That seems particularly important if there are no automated tests.

I think it will be good to have automated testing if possible. I was imagining some level of fairly detailed unit and some integration testing would be needed. I think that is broadly in line with the approach we discussed in Slack towards the start of the work. What do you think?

Deployment

    @monitor(
        monitor_slug="dump_sanitised_db", monitor_config=monitor_config("0 14 * * *")
    )

I think this adjusts the Sentry monitor, not the execution schedule. That's controlled by
app.json.

I am uncomfortable running this at an arbitrary time during the day after we first deploy it. As it runs directly in the production database, even with very thorough testing there is the risk that if something went wrong we could suffer material irretrievable data loss since the last backup. Instead, I suggest we initially set the schedule of this job to yearly, as that isn't run by the dokku cron process configured in app.json. So it won't run automatically.

We can test how the script behaves in production by taking a manual backup then running this script either via dokku or directly by the runjob management command to manually generate a sanitised backup. Once we're comfortable, we can set a schedule and Sentry monitor, but only once we're ready to run it on a schedule instead of dump_db.

What happens if we need a raw unsanitised copy of prod data? There are valid reasons why we might want that locally, temporarily, including debugging this script or other (potentially urgent) production issues. Maybe we should keep both scripts but dump_db is only run on demand?

Documentation

I think the script would benefit significantly from in-source documentation such as comments and docstrings, and perhaps more explanation in the git history. This script represents a significant amount of work and is implementing some complicated behaviour. So, significant commentary is probably needed for this to be easily understandable and therefore maintainable.

At one point you had 'section header' comments in the execute function like "Ensure output directory exists" that I think were a really useful starting point for immediately understanding the structure of the main execute function code blocks. I think the private functions are fairly well named and marked up with type hints, but I think what some of them are doing could be clearer if the code was restructured to be less compact, the more complicated logic merits some detailed commentary, and most of the private functions would benefit from a docstring.

Other points on the script

Have you thought about how we can best defend against accidental SQL injection? I wonder if we should be using something like quote_ident to wrap the expressions rather than doing it 'manually' in many F-strings. Is the level of escaping sufficient? Is it easy to get it right 'manually' if it needs to change? What do you think?

I am not sure I understand why the script doesn't terminate if there is no allowlist and this behaviour is not documented that I can see. I am not sure we need _run_pg_dump_schema_only at all. I don't think this is part of the requirements. Maybe you used this while developing or testing? If you agree it's not necessary, can we simplify by removing this? Or is is still useful?

Further on _fake_expression, I don't think it needs to handle arbitrary schemas, or data types we don't currently exclude, and some of those replacements are hard to reason about, so could you reduce it down to just what we need at the moment as a starting point?

Use of git

I found it quite hard to follow the git commits and their comments. In general, please ensure your commits and commit messages follow the advice at https://cbea.ms/git-commit/, and keep your commits atomic and separate each functional change, and don't mix those with restructures or refactors. In this case, as you are adding a new script it may be easier to leave the commits as they are while developing and by the time you merge you just rebase and edit down to have one commit that adds the new script, with some high-level explanation and most of the commentary in the script. In future we won't need the detailed history of your development process, but it can be useful during review.

@nishtha-kalra nishtha-kalra marked this pull request as draft December 18, 2025 15:12
Introduce a per-table/column allowlist used by the dump_db job to build sanitized database dumps for local development. The allowlist excludes sensitive fields such as emails, phone numbers, and tokens while preserving schema and non-sensitive data.
- remove sensitive data from getting copied
- created an allowe list.json which provides all the columns that are
  safe to copy
Making them null so that the code doesn't break
Replaced it with json instead
- add fake data for non-allowlisted columns
- remove debug and print statements
- change script to run daily instead of hourly as this database copy is
  used by developers for local development
- improve readability
- updated script to make it more readable
- changed it to DailyJob instead of Hourly
Added description so that someone who encounters it accidently knows
what the schema is for
Added information about
- sanitised dump
- temporary schema
- allow list
Put back original hourly script and rename the daily script so that we
could test that the new script works end to end and the hourly script
keeps running as it is.
Will replace the hourly script with daily once testing is done
End to end testing lead to the discovery that after restoring sanitised
backup, the UI was not getting rendered.

There were two issues:

1. The sanitised dump job copies data into temp_scrubbed_schema and
   dumps only that schema. Therefore DATABASE_URL needs the suffix
   ?options=-csearch_path=temp_scrubbed_schema,public so that the
   search path makes the app automatically read from the scrubbed data

2. django_session needed session_key and expire_date in the allow list
   because the job treats an empty list as “delete all rows.” Earlier,
   the table was listed with an empty array, so the sanitiser deleted
   every session entry. After restoring the dump, Django couldn’t log
   anyone in because the table had neither rows nor allowed columns.
   By allow-listing session_key and expire_date (while still redacting
   session_data), the session table remains usable: Django can create
   sessions using its normal schema, but sensitive payloads are still
   blanked out.
After initial set of review, got some suggestions to improve
_fake_expression function.

1. Removed depedency on id as every table may not contain an id field.
   Replaced it with random generated number instead
2. Instead of returning NULL when an unhandled data type encountered,
   throwing an error so that we can either handle it or allow list that
   column.
3. Only handling basic datatypes for now
4. Fixing the column name trace_context in allowlist.json
Addressing a comment on the PR that we would want to terminate the
script if allowlist is not present.
Removing _run_pg_dump_schema_only as we would not want to dump schema
only when allowlist is not present.
Drop any leftover `temp_scrubbed_schema` before starting a dump, and
ensure `self._drop_temp_schema()` runs even if
`_create_safe_schema_and_copy` or `pg_dump` fail. This prevents
orphaned scratch schemas and avoids duplicate-key errors on reruns
after a KeyboardInterrupt.

Wrap the entire dump in `NamedTemporaryFile(..., delete=True)`. We now
write the dump into a temp file, copy only after everything succeeds,
and let the context manager delete the temp automatically.
1. Added some basic unit tests to make sure the script works as expected
2. Updated the error message in the script when allowlist.json file is
   not found
Initially queried and added is_nullable in col metadata as was planning
to consider this while copying data but decided to ignore this property
when creating fake values.
Added more unit tests to test for other functions that were not covered
before.

Fixed a test get_column_metadata as removed is_nullable from the script
but forgot to remove it from test.
1. Removed some tables from allowlist as it was added when I created
   allowlist using production db as reference. After testing realised
   that these tables were part of legacy tables which are now replaced
   by jobserver.User

2. Create a new file for integration test and added some tests to
   verify every allow-listed table and column exists in the database.
   Added another test to verify that non allow listed data is faked when
   copying tables.
Updated _fake_expression function to append ROW_NUMBER() instead of
randomly generated number. The reason to do so is that with random
generated number, it's possible to generate the same fake value twice
and hit a unique constraint. ROW_NUMBER() assigns a sequential number
to each row, thus each generated fake value will be unique.
Noticed that the execution schedule is controlled by app.json. @monitor
only adjust the sentry monitor. Initially setting the schedule of this
job to yearly, as that isn't run by the dokku cron process configured
in app.json. So it won't run automatically. We can manually run and
verify that the sanitised dump works in this way. Thanks to @mikerkelly
for this suggestion.

Changes covered in this commit:
- Deleted the existing daily dump_sanitised_db.py script and
  allowlist.json and replaced it with a new yearly version.

- Updated the job class to inherit from YearlyJob instead of DailyJob.

- Adjusted the cron schedule for the job to run annually.

- Modified integration and unit tests to reference the new yearly job.

- Fixed 2 unit tests for _fake_expression as they were still expecting
  random values rather than ROW_NUMBER
- Refactored SQL execution to use psycopg's sql module for safer queries
  in order to prevent sql injection.
- Updated test test_create_safe_table_builds_sql to support this change.
Added a detailed document for the script explaining what the script
does, its algorithm, how allow list is created, and other other notes
and decision taken.
- Updated _build_select_expressions to throw an error if allowlist
  references a column that is missing from db. Added this change so
  that the script fails if there is a change in db but not in
  allowlist.
- Added a unit test to cover this change.
- Added FAQ section in document to explain this change and other related
  questions.
- The restore-db recipe now checks whether the restored dump contains
  temp_scrubbed_schema. If it does, it drops public and renames the
  scrubbed schema to public. If not, meaning it’s an unsanitised dump,
  it leaves public untouched so the restore behaves exactly like it
  used to. This way restore-db jobserver.dump still restores the raw
  dump, and restore-db sanitised_jobserver.dump handles the schema
  rename automatically. Developers can keep using the old flow until we
  roll out the sanitised dump.

- Removed changes to DATABASE_URL in dotenv-sample as we don't need to
  manually point to temp schema with the changes in justfile.
Removed the mention of sanitised dump script from developers doc
because we are not rolling out this script to everyone yet. Once this
script is merged, we will test is manually first and once we are
satisfied, we will change it from yearly to daily dump and will add its
documentation on Developers doc.
@nishtha-kalra nishtha-kalra marked this pull request as ready for review January 12, 2026 11:03
- Added a script to get a list of columns that are not present in
  allowlist.
- Added all_columns.txt and excluded_columns.txt in .gitignore so that
  they are not committed.
Changed the location of the script from jobserver/jobs/yearly/ to
script/ as only otherwise this script is treated as another Job module.
Shifting it into scripts/ folder is better as this is just a helper
script.
- Improved docstring for the script and added markdown document to the
  docstring.
- Added manual testing steps to markdown doc and added context for
  clarity.
- Add information about get_excluded_columns script
- Add a diagram to explain flowchart
Copy link
Copy Markdown
Contributor

@mikerkelly mikerkelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Nishtha. This is much improved. The overview documentation is excellent, explaining the "why" as well as the "what". The script to get the denied columns is very helpful.

There is a fair amount of new content since my first review, sorry it has taken a while to look at. There are a number of mostly minor inline comments, and I think we need to consider the points below. We are getting there!

Future work

We have discussed here and in Slack some possible future work. Do you have a list of those issues, will you create them?

SQL safety

I am concerned that the script may still be mixing safe and unsafe string interpolation and construction for SQL statements. I think that it is important to get this right as the script may be used on arbitrary data in future, depending how the database schema evolves.

From reading the psycopg2 documentation, I think that it is risky to construct SQL via f-string formatting, str concatenation, % formatting, or by composing any strings generated in these ways. It's just too hard to get the quoting and escaping right and consider all the edge cases. I would much rather fully follow the approach intended by that module and offload all the validation and safety to it, rather than partially roll our own validation with _valid_ident and mix strings validated through that with the other approach, it is hard to reason about whether that has been applied sufficiently and correctly in all instances to produce safe statements. Here is the approach described:

Usually you should express the template of your query as an SQL instance with {}-style placeholders and use format() to merge the variable parts into them, all of which must be Composable subclasses.

I think _get_column_metadata, _build_select_expressions, _fake_expression, test_create_safe_schema_scrubs_disallowed_columns all interpolate unsafely or include unsafe fragments. I think _create_safe_schema_and_copy and _drop_temp_schema use the right approach. I am not sure we need _valid_ident or _normalize_table_name if we contruct everything through the sql module.

Requirements / correctness

The script executes each line of SQL as it constructs it. I think that this risks the copied tables being inconsistent with related tables if they are modified by another process between lines being written. So should we do all our change inside a transaction and/or take a write lock so nothing can vary while we create the dump?

The script can be triggered manually, so I think we should at least think about what happens if it is run twice at the same time and both edit the non-public schema. I think that would not have a good result. We could add a lockfile or other mechanism so that can not happen, but maybe the approach from the previous paragraph is sufficient (database locking or transactions)?

I wondered it is worth recording row counts for every table in the source and destination schemas and comparing, to check that at least those are consistent. What do you think?

Performance/observability

I have not looked at performance, have you? Other than noting that it runs in some 10s of seconds locally. I think that is enough until we have evidence it is a problem. For observability, can we add an otel span around the whole script and show that on some suitable dashboard or a new one to make it easy to keep an eye on?

There is no logging at all at the moment. Can you please use structlog to record at least start/end of the main function and key params? Logging relevant details in the event of error would also be useful.

Testing

safety: I think each function that can produce SQL statements should have a test with a bad/malicious expression testing that the result is properly escaped sql.sql. I find it hard to be confident we are doing the right thing without that.

observation, no action required: A fuller integration test seems tricky, not sure if we would create a new database just in that one test, or do a a lot of database and file access patching still. Just testing _create_safe_schema_and_copy with one database row and short allowlist as you have done seems like a reasonable compromise.

optional: It would be good to have at least one test of the main execute function, with some appropriate level of patching/mocking. That code is currently untested, I think. Not essential though, manual testing can compensate, and I suppose any future development would want to do manual testing also.

docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c "pg_restore --clean --if-exists --no-acl --no-owner -d \$DATABASE_URL /tmp/jobserver.dump"
docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c '
set -eux
pg_restore --clean --if-exists --no-acl --no-owner -d "$DATABASE_URL" /tmp/jobserver.dump
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, documentation: Could use a comment here explaining that if restoring the sanitised dump we need to alter the scratch schema name to public, and a reference to the overview documentation.

just devenv
```

4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, wording:

Suggested change
4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server.
4. In another terminal, download and restore the latest production database dump to a separate database so we can run the script against it without changing the local development `jobserver` database. In production, the script runs directly against the production `jobserver` database.

6. Build assets

```
just assets-rebuild
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is this step necessary?

just manage runjob dump_sanitised_db
```

8. Restore this sanitised db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, wording:

Suggested change
8. Restore this sanitised db
8. Restore the sanitised database dump into your local `jobserver` database, deleting all existing data.

## Manually testing the script

We can manually test this script by following these steps:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can we warn this will overwrite all their local data and suggest taking a backup before if required?

Comment on lines +161 to +169
def _create_safe_table(self, cur, src_table, dst_table) -> None:
"""Create a destination table identical to the source."""
create_sql = sql.SQL(
"CREATE TABLE IF NOT EXISTS {dst} (LIKE {src} INCLUDING ALL);"
).format(dst=dst_table, src=src_table)
try:
cur.execute(create_sql)
except Exception as exc:
raise RuntimeError(f"Failed to create table {dst_table}: {exc}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: why is this "safe"?
documentation: I think it is clearer to say it is "an empty destination table with the same structure as the src".
SQL safety: I think this is only safe if the parameters are sql.Identifier. At the moment they are, but that would be easy to get wrong in future. At least we should provide a type hint and/or mention it in the docstring. I think it would be safer to check types and raise ValueError if not as required. I am also tempted to say we could maybe not have this function at all as it is only used once, I am not sure the indirection adds that much value.

from jobserver.jobs.yearly import dump_sanitised_db


@pytest.mark.django_db
Copy link
Copy Markdown
Contributor

@mikerkelly mikerkelly Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taste, optional: I am not sure why this is an integration test, could you explain? I think it might live happily in its own file in the root of tests.

Comment on lines +55 to +69
with connection.cursor() as cur:
cur.execute(
"""
SELECT id, username, email, password
FROM "temp_scrubbed_schema"."jobserver_user"
WHERE id = %s
""",
[user.id],
)
row = cur.fetchone()
assert row is not None
assert row[0] == user.id
assert row[1] == user.username
assert row[2].startswith("fake_jobserver_user_email_")
assert row[3].startswith("fake_jobserver_user_password_")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle to the detail of the faking. Can we assert that rows 2 and 3 are not equal to the original value, instead?

from jobserver.jobs.yearly import dump_sanitised_db


class StubCursor:
Copy link
Copy Markdown
Contributor

@mikerkelly mikerkelly Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: The sql parameter names in these classes shadow the imported sql. I think it is prudent to use a different name like query.

return dump_sanitised_db.Job()


def test_load_allowlist_success(tmp_path, job):
Copy link
Copy Markdown
Contributor

@mikerkelly mikerkelly Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional, performance: I think your allow list tests are hitting the real file system, which is slow and unnecessary. Can you mock the file access instead?

Issue:
Encountered an error of duplicate key value when trying to add a member
to a project or run a test job.

Reason:
The sanitised job copies data into temp_scrubbed_schema but never
advanced the per-table sequences, so every *_id_seq remained at 1. When
we restored sanitised_jobserver.dump, postgres recreated the tables with
real rows but left the sequences at 1, meaning the very next insert
reused id=1 and failed with duplicate key value violates …_pkey.

Fix:
Added a setval() step after each table copy so the scratch schema’s
sequence is bumped to MAX(id) (only for tables with an integer id).
Because pg_dump captures that setval, the restored dump now carries
correct sequence values just like the raw production dump.

Testing:
After this fix, I was able to test the following actions successfully:
- Editing a user's full name
- Adding a user to an organisation
- Giving access of backend to the user
- Added a member to a project
- Reach the run jobs page with the ability to ran a job.
@nishtha-kalra nishtha-kalra marked this pull request as draft January 27, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants