Update db export script to generate PII scrubbed copy of the Job Server production database by nishtha-kalra · Pull Request #5437 · opensafely-core/job-server

nishtha-kalra · 2025-11-17T17:05:01Z

Description

Issue - https://github.com/opensafely-core/security/issues/33

Updated the dump_db script to now generate a sanitised backup of database with scrubbed PII.
Updated the relevant document accordingly.
PR for ADR - ADR for PII scrubbed copy of Job Server database for local development #5461
Tested this locally by changing the DATABASE_URL in .env to point to my local db instead which had copy of dump from production.
Slack thread related to this can be found here for anyone interested

Manual testing on new setup:

clone and switch to branch nk/export-script
switch to virtual env
copy dotenv-sample and rename copy to .env

scripts/dev-env.sh .env

Start docker and install requirements

just docker/db
just devenv

In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump.

scp dokku4:/var/lib/dokku/data/storage/job-server/jobserver.dump jobserver.dump
createdb -h localhost -p 6543 -U user jobserver_prod 
pg_restore -h localhost -p 6543 -U user -d jobserver_prod --no-owner --no-acl jobserver.dump

Update DATABASE_URL in .env to point to this db

DATABASE_URL=postgres://user:pass@localhost:6543/jobserver_prod

Build assets

just assets-rebuild

Load changes in .env and run the sanitised db job to create a sanitised dump.
Make sure the OUTPUT_PATH mentioned in the script is available on your system. I temporarily change the path to ./sanitised_jobserver.dump.

set -a; source .env; set +a
just manage runjob dump_sanitised_db

Restore this sanitised db

just docker/restore-db sanitised_jobserver.dump

Update DATABASE_URL in .env to point back to db

DATABASE_URL=postgres://user:pass@localhost:6543/jobserver

Load changes in .env and run

set -a; source .env; set +a 
just manage migrate
just run

Go to Account -> staff area -> users -> select any user and check that the email is now fake
Attaching screenshot too for help

mikerkelly

Overall

This script seems like a well-structured solution to a very knotty problem that was complicated by our single-node infrastructure. I think the overall approach is good, given our constraints, and appreciate how you've broken down the main function into several smaller functions for structure. However, I do think there are several issues that must be resolved before we can merge this, some of them quite significant. I have some comments and questions below and inline. I appreciate this is quite a lot of feedback all at once, on a big complicated topic, so I've tried to keep it high level rather than too many line-by-line code comments.

Included/excluded fields

I think you decided which fields to include or exclude during this work. This was an area mentioned to be explored in the issue. How did you decide on what to include or exclude? It is not easy to analyse what is excluded from the allow list. There are many more inclusions than exclusions but the exclusions are the most significant information, and are only implicit in the allow list. They will also change over time as fields are added. How will someone looking at this in future understand what they are? Is there anything that can be documented about your thought process? If you used any temporary tooling for this, can it be preserved somewhere in case it is useful if we need to revisit this?

What will happen if a field is renamed, removed, or changes data type; or a developer forgets to add a new or changed one to the allow list? Will the script fail? Will the dump be valid?

I am concerned that _fake_expression may lead to errors if used on a table with a uniqueness constraint. I also think it assumes every table contains an id field when that may not be the case in future if we defined a different primary key field on a model, so I wonder if we should be robust to that? I also think we may want to end the program with an error if used on an unhandled data type, not return NULL, as I think this is unexpected and ought not continue silently. This could cause the script to fail if some of the changes mentioned above are not adjusted for. Probably it's better if the script fails and we hear about it.

I think Peter's suggestion from the Slack thread you link may address some of those issues, did you discard it? Also testing and alerting may mitigate it.

Testing

This script contributes to an IG requirement, the approach and details involve some complexity, and it makes potentially destructive changes directly in the production database. So I think strong testing here is important. This is also important so if we need to maintain or change the script we can do so confidently and efficiently.

Tested this locally by changing the DATABASE_URL in .env to point to my local db instead which had copy of dump from production.

We can possibly rely only on manual testing if automated testing is unfeasible, but we should record and execute a very thorough test plan for assurance in that case. That seems particularly important if there are no automated tests.

I think it will be good to have automated testing if possible. I was imagining some level of fairly detailed unit and some integration testing would be needed. I think that is broadly in line with the approach we discussed in Slack towards the start of the work. What do you think?

Deployment

    @monitor(
        monitor_slug="dump_sanitised_db", monitor_config=monitor_config("0 14 * * *")
    )

I think this adjusts the Sentry monitor, not the execution schedule. That's controlled by
app.json.

I am uncomfortable running this at an arbitrary time during the day after we first deploy it. As it runs directly in the production database, even with very thorough testing there is the risk that if something went wrong we could suffer material irretrievable data loss since the last backup. Instead, I suggest we initially set the schedule of this job to yearly, as that isn't run by the dokku cron process configured in app.json. So it won't run automatically.

We can test how the script behaves in production by taking a manual backup then running this script either via dokku or directly by the runjob management command to manually generate a sanitised backup. Once we're comfortable, we can set a schedule and Sentry monitor, but only once we're ready to run it on a schedule instead of dump_db.

What happens if we need a raw unsanitised copy of prod data? There are valid reasons why we might want that locally, temporarily, including debugging this script or other (potentially urgent) production issues. Maybe we should keep both scripts but dump_db is only run on demand?

Documentation

I think the script would benefit significantly from in-source documentation such as comments and docstrings, and perhaps more explanation in the git history. This script represents a significant amount of work and is implementing some complicated behaviour. So, significant commentary is probably needed for this to be easily understandable and therefore maintainable.

At one point you had 'section header' comments in the execute function like "Ensure output directory exists" that I think were a really useful starting point for immediately understanding the structure of the main execute function code blocks. I think the private functions are fairly well named and marked up with type hints, but I think what some of them are doing could be clearer if the code was restructured to be less compact, the more complicated logic merits some detailed commentary, and most of the private functions would benefit from a docstring.

Other points on the script

Have you thought about how we can best defend against accidental SQL injection? I wonder if we should be using something like quote_ident to wrap the expressions rather than doing it 'manually' in many F-strings. Is the level of escaping sufficient? Is it easy to get it right 'manually' if it needs to change? What do you think?

I am not sure I understand why the script doesn't terminate if there is no allowlist and this behaviour is not documented that I can see. I am not sure we need _run_pg_dump_schema_only at all. I don't think this is part of the requirements. Maybe you used this while developing or testing? If you agree it's not necessary, can we simplify by removing this? Or is is still useful?

Further on _fake_expression, I don't think it needs to handle arbitrary schemas, or data types we don't currently exclude, and some of those replacements are hard to reason about, so could you reduce it down to just what we need at the moment as a starting point?

Use of git

I found it quite hard to follow the git commits and their comments. In general, please ensure your commits and commit messages follow the advice at https://cbea.ms/git-commit/, and keep your commits atomic and separate each functional change, and don't mix those with restructures or refactors. In this case, as you are adding a new script it may be easier to leave the commits as they are while developing and by the time you merge you just rebase and edit down to have one commit that adds the new script, with some high-level explanation and most of the commentary in the script. In future we won't need the detailed history of your development process, but it can be useful during review.

jobserver/jobs/daily/dump_sanitised_db.py

Introduce a per-table/column allowlist used by the dump_db job to build sanitized database dumps for local development. The allowlist excludes sensitive fields such as emails, phone numbers, and tokens while preserving schema and non-sensitive data.

- remove sensitive data from getting copied - created an allowe list.json which provides all the columns that are safe to copy

Making them null so that the code doesn't break

Replaced it with json instead

- add fake data for non-allowlisted columns - remove debug and print statements

- change script to run daily instead of hourly as this database copy is used by developers for local development - improve readability

- updated script to make it more readable - changed it to DailyJob instead of Hourly

Added description so that someone who encounters it accidently knows what the schema is for

Added information about - sanitised dump - temporary schema - allow list

Put back original hourly script and rename the daily script so that we could test that the new script works end to end and the hourly script keeps running as it is. Will replace the hourly script with daily once testing is done

End to end testing lead to the discovery that after restoring sanitised backup, the UI was not getting rendered. There were two issues: 1. The sanitised dump job copies data into temp_scrubbed_schema and dumps only that schema. Therefore DATABASE_URL needs the suffix ?options=-csearch_path=temp_scrubbed_schema,public so that the search path makes the app automatically read from the scrubbed data 2. django_session needed session_key and expire_date in the allow list because the job treats an empty list as “delete all rows.” Earlier, the table was listed with an empty array, so the sanitiser deleted every session entry. After restoring the dump, Django couldn’t log anyone in because the table had neither rows nor allowed columns. By allow-listing session_key and expire_date (while still redacting session_data), the session table remains usable: Django can create sessions using its normal schema, but sensitive payloads are still blanked out.

After initial set of review, got some suggestions to improve _fake_expression function. 1. Removed depedency on id as every table may not contain an id field. Replaced it with random generated number instead 2. Instead of returning NULL when an unhandled data type encountered, throwing an error so that we can either handle it or allow list that column. 3. Only handling basic datatypes for now 4. Fixing the column name trace_context in allowlist.json

Addressing a comment on the PR that we would want to terminate the script if allowlist is not present. Removing _run_pg_dump_schema_only as we would not want to dump schema only when allowlist is not present.

Drop any leftover `temp_scrubbed_schema` before starting a dump, and ensure `self._drop_temp_schema()` runs even if `_create_safe_schema_and_copy` or `pg_dump` fail. This prevents orphaned scratch schemas and avoids duplicate-key errors on reruns after a KeyboardInterrupt. Wrap the entire dump in `NamedTemporaryFile(..., delete=True)`. We now write the dump into a temp file, copy only after everything succeeds, and let the context manager delete the temp automatically.

1. Added some basic unit tests to make sure the script works as expected 2. Updated the error message in the script when allowlist.json file is not found

Initially queried and added is_nullable in col metadata as was planning to consider this while copying data but decided to ignore this property when creating fake values.

Added more unit tests to test for other functions that were not covered before. Fixed a test get_column_metadata as removed is_nullable from the script but forgot to remove it from test.

1. Removed some tables from allowlist as it was added when I created allowlist using production db as reference. After testing realised that these tables were part of legacy tables which are now replaced by jobserver.User 2. Create a new file for integration test and added some tests to verify every allow-listed table and column exists in the database. Added another test to verify that non allow listed data is faked when copying tables.

Updated _fake_expression function to append ROW_NUMBER() instead of randomly generated number. The reason to do so is that with random generated number, it's possible to generate the same fake value twice and hit a unique constraint. ROW_NUMBER() assigns a sequential number to each row, thus each generated fake value will be unique.

@mikerkelly

Noticed that the execution schedule is controlled by app.json. @monitor only adjust the sentry monitor. Initially setting the schedule of this job to yearly, as that isn't run by the dokku cron process configured in app.json. So it won't run automatically. We can manually run and verify that the sanitised dump works in this way. Thanks to @mikerkelly for this suggestion. Changes covered in this commit: - Deleted the existing daily dump_sanitised_db.py script and allowlist.json and replaced it with a new yearly version. - Updated the job class to inherit from YearlyJob instead of DailyJob. - Adjusted the cron schedule for the job to run annually. - Modified integration and unit tests to reference the new yearly job. - Fixed 2 unit tests for _fake_expression as they were still expecting random values rather than ROW_NUMBER

- Refactored SQL execution to use psycopg's sql module for safer queries in order to prevent sql injection. - Updated test test_create_safe_table_builds_sql to support this change.

Added a detailed document for the script explaining what the script does, its algorithm, how allow list is created, and other other notes and decision taken.

- Updated _build_select_expressions to throw an error if allowlist references a column that is missing from db. Added this change so that the script fails if there is a change in db but not in allowlist. - Added a unit test to cover this change. - Added FAQ section in document to explain this change and other related questions.

- The restore-db recipe now checks whether the restored dump contains temp_scrubbed_schema. If it does, it drops public and renames the scrubbed schema to public. If not, meaning it’s an unsanitised dump, it leaves public untouched so the restore behaves exactly like it used to. This way restore-db jobserver.dump still restores the raw dump, and restore-db sanitised_jobserver.dump handles the schema rename automatically. Developers can keep using the old flow until we roll out the sanitised dump. - Removed changes to DATABASE_URL in dotenv-sample as we don't need to manually point to temp schema with the changes in justfile.

Removed the mention of sanitised dump script from developers doc because we are not rolling out this script to everyone yet. Once this script is merged, we will test is manually first and once we are satisfied, we will change it from yearly to daily dump and will add its documentation on Developers doc.

- Added a script to get a list of columns that are not present in allowlist. - Added all_columns.txt and excluded_columns.txt in .gitignore so that they are not committed.

Changed the location of the script from jobserver/jobs/yearly/ to script/ as only otherwise this script is treated as another Job module. Shifting it into scripts/ folder is better as this is just a helper script.

- Improved docstring for the script and added markdown document to the docstring. - Added manual testing steps to markdown doc and added context for clarity.

- Add information about get_excluded_columns script - Add a diagram to explain flowchart

mikerkelly

Thanks a lot, Nishtha. This is much improved. The overview documentation is excellent, explaining the "why" as well as the "what". The script to get the denied columns is very helpful.

There is a fair amount of new content since my first review, sorry it has taken a while to look at. There are a number of mostly minor inline comments, and I think we need to consider the points below. We are getting there!

Future work

We have discussed here and in Slack some possible future work. Do you have a list of those issues, will you create them?

SQL safety

I am concerned that the script may still be mixing safe and unsafe string interpolation and construction for SQL statements. I think that it is important to get this right as the script may be used on arbitrary data in future, depending how the database schema evolves.

From reading the psycopg2 documentation, I think that it is risky to construct SQL via f-string formatting, str concatenation, % formatting, or by composing any strings generated in these ways. It's just too hard to get the quoting and escaping right and consider all the edge cases. I would much rather fully follow the approach intended by that module and offload all the validation and safety to it, rather than partially roll our own validation with _valid_ident and mix strings validated through that with the other approach, it is hard to reason about whether that has been applied sufficiently and correctly in all instances to produce safe statements. Here is the approach described:

Usually you should express the template of your query as an SQL instance with {}-style placeholders and use format() to merge the variable parts into them, all of which must be Composable subclasses.

I think _get_column_metadata, _build_select_expressions, _fake_expression, test_create_safe_schema_scrubs_disallowed_columns all interpolate unsafely or include unsafe fragments. I think _create_safe_schema_and_copy and _drop_temp_schema use the right approach. I am not sure we need _valid_ident or _normalize_table_name if we contruct everything through the sql module.

Requirements / correctness

The script executes each line of SQL as it constructs it. I think that this risks the copied tables being inconsistent with related tables if they are modified by another process between lines being written. So should we do all our change inside a transaction and/or take a write lock so nothing can vary while we create the dump?

The script can be triggered manually, so I think we should at least think about what happens if it is run twice at the same time and both edit the non-public schema. I think that would not have a good result. We could add a lockfile or other mechanism so that can not happen, but maybe the approach from the previous paragraph is sufficient (database locking or transactions)?

I wondered it is worth recording row counts for every table in the source and destination schemas and comparing, to check that at least those are consistent. What do you think?

Performance/observability

I have not looked at performance, have you? Other than noting that it runs in some 10s of seconds locally. I think that is enough until we have evidence it is a problem. For observability, can we add an otel span around the whole script and show that on some suitable dashboard or a new one to make it easy to keep an eye on?

There is no logging at all at the moment. Can you please use structlog to record at least start/end of the main function and key params? Logging relevant details in the event of error would also be useful.

Testing

safety: I think each function that can produce SQL statements should have a test with a bad/malicious expression testing that the result is properly escaped sql.sql. I find it hard to be confident we are doing the right thing without that.

observation, no action required: A fuller integration test seems tricky, not sure if we would create a new database just in that one test, or do a a lot of database and file access patching still. Just testing _create_safe_schema_and_copy with one database row and short allowlist as you have done seems like a reasonable compromise.

optional: It would be good to have at least one test of the main execute function, with some appropriate level of patching/mocking. That code is currently untested, I think. Not essential though, manual testing can compensate, and I suppose any future development would want to do manual testing also.

mikerkelly · 2026-01-16T13:19:00Z

docker/justfile

-    docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c "pg_restore --clean --if-exists --no-acl --no-owner -d \$DATABASE_URL /tmp/jobserver.dump"
+    docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c '
+        set -eux
+        pg_restore --clean --if-exists --no-acl --no-owner -d "$DATABASE_URL" /tmp/jobserver.dump


minor, documentation: Could use a comment here explaining that if restoring the sanitised dump we need to alter the scratch schema name to public, and a reference to the overview documentation.

mikerkelly · 2026-01-16T13:29:22Z

jobserver/jobs/yearly/dump_sanitised_db.md

+just devenv
+```
+
+4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server.


minor, wording:

Suggested change

4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server.

4. In another terminal, download and restore the latest production database dump to a separate database so we can run the script against it without changing the local development `jobserver` database. In production, the script runs directly against the production `jobserver` database.

mikerkelly · 2026-01-16T13:30:15Z

jobserver/jobs/yearly/dump_sanitised_db.md

+6. Build assets
+
+```
+just assets-rebuild


Question: is this step necessary?

mikerkelly · 2026-01-16T13:32:31Z

jobserver/jobs/yearly/dump_sanitised_db.md

+just manage runjob dump_sanitised_db
+```
+
+8. Restore this sanitised db


minor, wording:

Suggested change

8. Restore this sanitised db

8. Restore the sanitised database dump into your local `jobserver` database, deleting all existing data.

mikerkelly · 2026-01-16T13:33:30Z

jobserver/jobs/yearly/dump_sanitised_db.md

+## Manually testing the script
+
+We can manually test this script by following these steps:
+


minor: can we warn this will overwrite all their local data and suggest taking a backup before if required?

mikerkelly · 2026-01-18T18:30:42Z

jobserver/jobs/yearly/dump_sanitised_db.py

+    def _create_safe_table(self, cur, src_table, dst_table) -> None:
+        """Create a destination table identical to the source."""
+        create_sql = sql.SQL(
+            "CREATE TABLE IF NOT EXISTS {dst} (LIKE {src} INCLUDING ALL);"
+        ).format(dst=dst_table, src=src_table)
+        try:
+            cur.execute(create_sql)
+        except Exception as exc:
+            raise RuntimeError(f"Failed to create table {dst_table}: {exc}")


naming: why is this "safe"?
documentation: I think it is clearer to say it is "an empty destination table with the same structure as the src".
SQL safety: I think this is only safe if the parameters are sql.Identifier. At the moment they are, but that would be easy to get wrong in future. At least we should provide a type hint and/or mention it in the docstring. I think it would be safer to check types and raise ValueError if not as required. I am also tempted to say we could maybe not have this function at all as it is only used once, I am not sure the indirection adds that much value.

mikerkelly · 2026-01-18T19:09:31Z

tests/integration/test_dump_sanitised_db.py

+from jobserver.jobs.yearly import dump_sanitised_db
+
+
+@pytest.mark.django_db


taste, optional: I am not sure why this is an integration test, could you explain? I think it might live happily in its own file in the root of tests.

mikerkelly · 2026-01-18T19:17:17Z

tests/integration/test_dump_sanitised_db.py

+        with connection.cursor() as cur:
+            cur.execute(
+                """
+                SELECT id, username, email, password
+                FROM "temp_scrubbed_schema"."jobserver_user"
+                WHERE id = %s
+                """,
+                [user.id],
+            )
+            row = cur.fetchone()
+        assert row is not None
+        assert row[0] == user.id
+        assert row[1] == user.username
+        assert row[2].startswith("fake_jobserver_user_email_")
+        assert row[3].startswith("fake_jobserver_user_password_")


This is brittle to the detail of the faking. Can we assert that rows 2 and 3 are not equal to the original value, instead?

mikerkelly · 2026-01-18T19:22:30Z

tests/unit/jobserver/jobs/test_dump_sanitised_db.py

+from jobserver.jobs.yearly import dump_sanitised_db
+
+
+class StubCursor:


optional: The sql parameter names in these classes shadow the imported sql. I think it is prudent to use a different name like query.

mikerkelly · 2026-01-18T19:26:56Z

tests/unit/jobserver/jobs/test_dump_sanitised_db.py

+    return dump_sanitised_db.Job()
+
+
+def test_load_allowlist_success(tmp_path, job):


optional, performance: I think your allow list tests are hitting the real file system, which is slow and unnecessary. Can you mock the file access instead?

Issue: Encountered an error of duplicate key value when trying to add a member to a project or run a test job. Reason: The sanitised job copies data into temp_scrubbed_schema but never advanced the per-table sequences, so every *_id_seq remained at 1. When we restored sanitised_jobserver.dump, postgres recreated the tables with real rows but left the sequences at 1, meaning the very next insert reused id=1 and failed with duplicate key value violates …_pkey. Fix: Added a setval() step after each table copy so the scratch schema’s sequence is bumped to MAX(id) (only for tables with an integer id). Because pg_dump captures that setval, the restored dump now carries correct sequence values just like the raw production dump. Testing: After this fix, I was able to test the following actions successfully: - Editing a user's full name - Adding a user to an organisation - Giving access of backend to the user - Added a member to a project - Reach the run jobs page with the ability to ran a job.

nishtha-kalra changed the title ~~Update db export script~~ Update db export script to remove sensitive data Nov 18, 2025

nishtha-kalra force-pushed the nk/export-script branch from 481e836 to 21a6a10 Compare November 18, 2025 06:04

nishtha-kalra changed the title ~~Update db export script to remove sensitive data~~ Update db export script to generate PII scrubbed copy of the Job Server production database Dec 2, 2025

nishtha-kalra force-pushed the nk/export-script branch from 9a43cd6 to e2b25f3 Compare December 2, 2025 10:19

nishtha-kalra marked this pull request as ready for review December 2, 2025 10:19

nishtha-kalra force-pushed the nk/export-script branch from 4b8ea9d to d8766fd Compare December 3, 2025 14:30

nishtha-kalra mentioned this pull request Dec 3, 2025

ADR for PII scrubbed copy of Job Server database for local development #5461

Merged

nishtha-kalra force-pushed the nk/export-script branch from d8766fd to e49743d Compare December 8, 2025 07:51

nishtha-kalra force-pushed the nk/export-script branch from e49743d to eb0cf06 Compare December 16, 2025 13:44

mikerkelly reviewed Dec 18, 2025

View reviewed changes

jobserver/jobs/daily/dump_sanitised_db.py Outdated Show resolved Hide resolved

jobserver/jobs/daily/dump_sanitised_db.py Outdated Show resolved Hide resolved

nishtha-kalra marked this pull request as draft December 18, 2025 15:12

nishtha-kalra added 19 commits January 12, 2026 11:02

Fix indentation in allow_list.yaml

9811567

Updating sump_db.py script

3980665

- remove sensitive data from getting copied - created an allowe list.json which provides all the columns that are safe to copy

Refactor script so that values for sensitive columns are made null

0fe066a

Making them null so that the code doesn't break

Removing allow_list.yaml

7fed429

Replaced it with json instead

Refactor dump_db script

176304d

- add fake data for non-allowlisted columns - remove debug and print statements

Refactor dump_db script

12ae1ed

- change script to run daily instead of hourly as this database copy is used by developers for local development - improve readability

Make code functional and readable

f3c310e

- updated script to make it more readable - changed it to DailyJob instead of Hourly

Rename temp schema and add description

4f4917a

Added description so that someone who encounters it accidently knows what the schema is for

Update DEVELOPERS.md with changes in dump_db script

2c84857

Added information about - sanitised dump - temporary schema - allow list

remove comment

9fb5a75

Rename script and dump

0d3a67e

Put back original hourly script and rename the daily script so that we could test that the new script works end to end and the hourly script keeps running as it is. Will replace the hourly script with daily once testing is done

Updating name of script in DEVELOPERS.md

8a653eb

Move allowlist_path and out_dir with constants

c313c23

Terminate script when allowlist is not present

880496c

Addressing a comment on the PR that we would want to terminate the script if allowlist is not present. Removing _run_pg_dump_schema_only as we would not want to dump schema only when allowlist is not present.

Add docstrings for clarity and documentation

b443315

nishtha-kalra added 12 commits January 12, 2026 11:02

Add unit tests for dump_sanitised_db script

1f6b671

1. Added some basic unit tests to make sure the script works as expected 2. Updated the error message in the script when allowlist.json file is not found

Removed is_nullable from column metadata

24eca16

Initially queried and added is_nullable in col metadata as was planning to consider this while copying data but decided to ignore this property when creating fake values.

More unit tests for dump_sanitised_db script

f7cd9ef

Added more unit tests to test for other functions that were not covered before. Fixed a test get_column_metadata as removed is_nullable from the script but forgot to remove it from test.

Refactor script to prevent sql injection

8724385

- Refactored SQL execution to use psycopg's sql module for safer queries in order to prevent sql injection. - Updated test test_create_safe_table_builds_sql to support this change.

Add document for Sanitised db dump job

954e2ab

Added a detailed document for the script explaining what the script does, its algorithm, how allow list is created, and other other notes and decision taken.

Remove mention of sanitised dump from doc

10e54f6

nishtha-kalra force-pushed the nk/export-script branch from addc3ef to 10e54f6 Compare January 12, 2026 11:02

nishtha-kalra marked this pull request as ready for review January 12, 2026 11:03

nishtha-kalra added 6 commits January 14, 2026 15:03

Get a list of tables: columns excluded from allowlist

108730a

- Added a script to get a list of columns that are not present in allowlist. - Added all_columns.txt and excluded_columns.txt in .gitignore so that they are not committed.

Update location of get_excluded_columns script

b456179

Changed the location of the script from jobserver/jobs/yearly/ to script/ as only otherwise this script is treated as another Job module. Shifting it into scripts/ folder is better as this is just a helper script.

Adding non sensitive columns fot social_auth_code in allowlist

89f8c4f

Improve documentation for dump sanitised db

9fd815a

- Improved docstring for the script and added markdown document to the docstring. - Added manual testing steps to markdown doc and added context for clarity.

Edit note about future of dump_db script

803f698

Add missing info in markdown file

3657891

- Add information about get_excluded_columns script - Add a diagram to explain flowchart

mikerkelly reviewed Jan 19, 2026

View reviewed changes

nishtha-kalra marked this pull request as draft January 27, 2026 22:58

	4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server.
	4. In another terminal, download and restore the latest production database dump to a separate database so we can run the script against it without changing the local development `jobserver` database. In production, the script runs directly against the production `jobserver` database.

	8. Restore this sanitised db
	8. Restore the sanitised database dump into your local `jobserver` database, deleting all existing data.

		## Manually testing the script

		We can manually test this script by following these steps:

		from jobserver.jobs.yearly import dump_sanitised_db


		@pytest.mark.django_db

		from jobserver.jobs.yearly import dump_sanitised_db


		class StubCursor:

		return dump_sanitised_db.Job()


		def test_load_allowlist_success(tmp_path, job):

Conversation

nishtha-kalra commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Manual testing on new setup:

Uh oh!

mikerkelly left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Overall

Included/excluded fields

Testing

Deployment

Documentation

Other points on the script

Use of git

Uh oh!

Uh oh!

Uh oh!

mikerkelly left a comment

Choose a reason for hiding this comment

Future work

SQL safety

Requirements / correctness

Performance/observability

Testing

Uh oh!

mikerkelly Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikerkelly Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nishtha-kalra commented Nov 17, 2025 •

edited

Loading

mikerkelly left a comment •

edited

Loading

mikerkelly Jan 18, 2026 •

edited

Loading

mikerkelly Jan 18, 2026 •

edited

Loading

mikerkelly Jan 18, 2026 •

edited

Loading