Update db export script to generate PII scrubbed copy of the Job Server production database#5437
Update db export script to generate PII scrubbed copy of the Job Server production database#5437nishtha-kalra wants to merge 38 commits intomainfrom
Conversation
481e836 to
21a6a10
Compare
9a43cd6 to
e2b25f3
Compare
4b8ea9d to
d8766fd
Compare
d8766fd to
e49743d
Compare
e49743d to
eb0cf06
Compare
There was a problem hiding this comment.
Overall
This script seems like a well-structured solution to a very knotty problem that was complicated by our single-node infrastructure. I think the overall approach is good, given our constraints, and appreciate how you've broken down the main function into several smaller functions for structure. However, I do think there are several issues that must be resolved before we can merge this, some of them quite significant. I have some comments and questions below and inline. I appreciate this is quite a lot of feedback all at once, on a big complicated topic, so I've tried to keep it high level rather than too many line-by-line code comments.
Included/excluded fields
I think you decided which fields to include or exclude during this work. This was an area mentioned to be explored in the issue. How did you decide on what to include or exclude? It is not easy to analyse what is excluded from the allow list. There are many more inclusions than exclusions but the exclusions are the most significant information, and are only implicit in the allow list. They will also change over time as fields are added. How will someone looking at this in future understand what they are? Is there anything that can be documented about your thought process? If you used any temporary tooling for this, can it be preserved somewhere in case it is useful if we need to revisit this?
What will happen if a field is renamed, removed, or changes data type; or a developer forgets to add a new or changed one to the allow list? Will the script fail? Will the dump be valid?
I am concerned that _fake_expression may lead to errors if used on a table with a uniqueness constraint. I also think it assumes every table contains an id field when that may not be the case in future if we defined a different primary key field on a model, so I wonder if we should be robust to that? I also think we may want to end the program with an error if used on an unhandled data type, not return NULL, as I think this is unexpected and ought not continue silently. This could cause the script to fail if some of the changes mentioned above are not adjusted for. Probably it's better if the script fails and we hear about it.
I think Peter's suggestion from the Slack thread you link may address some of those issues, did you discard it? Also testing and alerting may mitigate it.
Testing
This script contributes to an IG requirement, the approach and details involve some complexity, and it makes potentially destructive changes directly in the production database. So I think strong testing here is important. This is also important so if we need to maintain or change the script we can do so confidently and efficiently.
Tested this locally by changing the DATABASE_URL in .env to point to my local db instead which had copy of dump from production.
We can possibly rely only on manual testing if automated testing is unfeasible, but we should record and execute a very thorough test plan for assurance in that case. That seems particularly important if there are no automated tests.
I think it will be good to have automated testing if possible. I was imagining some level of fairly detailed unit and some integration testing would be needed. I think that is broadly in line with the approach we discussed in Slack towards the start of the work. What do you think?
Deployment
@monitor(
monitor_slug="dump_sanitised_db", monitor_config=monitor_config("0 14 * * *")
)
I think this adjusts the Sentry monitor, not the execution schedule. That's controlled by
app.json.
I am uncomfortable running this at an arbitrary time during the day after we first deploy it. As it runs directly in the production database, even with very thorough testing there is the risk that if something went wrong we could suffer material irretrievable data loss since the last backup. Instead, I suggest we initially set the schedule of this job to yearly, as that isn't run by the dokku cron process configured in app.json. So it won't run automatically.
We can test how the script behaves in production by taking a manual backup then running this script either via dokku or directly by the runjob management command to manually generate a sanitised backup. Once we're comfortable, we can set a schedule and Sentry monitor, but only once we're ready to run it on a schedule instead of dump_db.
What happens if we need a raw unsanitised copy of prod data? There are valid reasons why we might want that locally, temporarily, including debugging this script or other (potentially urgent) production issues. Maybe we should keep both scripts but dump_db is only run on demand?
Documentation
I think the script would benefit significantly from in-source documentation such as comments and docstrings, and perhaps more explanation in the git history. This script represents a significant amount of work and is implementing some complicated behaviour. So, significant commentary is probably needed for this to be easily understandable and therefore maintainable.
At one point you had 'section header' comments in the execute function like "Ensure output directory exists" that I think were a really useful starting point for immediately understanding the structure of the main execute function code blocks. I think the private functions are fairly well named and marked up with type hints, but I think what some of them are doing could be clearer if the code was restructured to be less compact, the more complicated logic merits some detailed commentary, and most of the private functions would benefit from a docstring.
Other points on the script
Have you thought about how we can best defend against accidental SQL injection? I wonder if we should be using something like quote_ident to wrap the expressions rather than doing it 'manually' in many F-strings. Is the level of escaping sufficient? Is it easy to get it right 'manually' if it needs to change? What do you think?
I am not sure I understand why the script doesn't terminate if there is no allowlist and this behaviour is not documented that I can see. I am not sure we need _run_pg_dump_schema_only at all. I don't think this is part of the requirements. Maybe you used this while developing or testing? If you agree it's not necessary, can we simplify by removing this? Or is is still useful?
Further on _fake_expression, I don't think it needs to handle arbitrary schemas, or data types we don't currently exclude, and some of those replacements are hard to reason about, so could you reduce it down to just what we need at the moment as a starting point?
Use of git
I found it quite hard to follow the git commits and their comments. In general, please ensure your commits and commit messages follow the advice at https://cbea.ms/git-commit/, and keep your commits atomic and separate each functional change, and don't mix those with restructures or refactors. In this case, as you are adding a new script it may be easier to leave the commits as they are while developing and by the time you merge you just rebase and edit down to have one commit that adds the new script, with some high-level explanation and most of the commentary in the script. In future we won't need the detailed history of your development process, but it can be useful during review.
Introduce a per-table/column allowlist used by the dump_db job to build sanitized database dumps for local development. The allowlist excludes sensitive fields such as emails, phone numbers, and tokens while preserving schema and non-sensitive data.
- remove sensitive data from getting copied - created an allowe list.json which provides all the columns that are safe to copy
Making them null so that the code doesn't break
Replaced it with json instead
- add fake data for non-allowlisted columns - remove debug and print statements
- change script to run daily instead of hourly as this database copy is used by developers for local development - improve readability
- updated script to make it more readable - changed it to DailyJob instead of Hourly
Added description so that someone who encounters it accidently knows what the schema is for
Added information about - sanitised dump - temporary schema - allow list
Put back original hourly script and rename the daily script so that we could test that the new script works end to end and the hourly script keeps running as it is. Will replace the hourly script with daily once testing is done
End to end testing lead to the discovery that after restoring sanitised backup, the UI was not getting rendered. There were two issues: 1. The sanitised dump job copies data into temp_scrubbed_schema and dumps only that schema. Therefore DATABASE_URL needs the suffix ?options=-csearch_path=temp_scrubbed_schema,public so that the search path makes the app automatically read from the scrubbed data 2. django_session needed session_key and expire_date in the allow list because the job treats an empty list as “delete all rows.” Earlier, the table was listed with an empty array, so the sanitiser deleted every session entry. After restoring the dump, Django couldn’t log anyone in because the table had neither rows nor allowed columns. By allow-listing session_key and expire_date (while still redacting session_data), the session table remains usable: Django can create sessions using its normal schema, but sensitive payloads are still blanked out.
After initial set of review, got some suggestions to improve _fake_expression function. 1. Removed depedency on id as every table may not contain an id field. Replaced it with random generated number instead 2. Instead of returning NULL when an unhandled data type encountered, throwing an error so that we can either handle it or allow list that column. 3. Only handling basic datatypes for now 4. Fixing the column name trace_context in allowlist.json
Addressing a comment on the PR that we would want to terminate the script if allowlist is not present. Removing _run_pg_dump_schema_only as we would not want to dump schema only when allowlist is not present.
Drop any leftover `temp_scrubbed_schema` before starting a dump, and ensure `self._drop_temp_schema()` runs even if `_create_safe_schema_and_copy` or `pg_dump` fail. This prevents orphaned scratch schemas and avoids duplicate-key errors on reruns after a KeyboardInterrupt. Wrap the entire dump in `NamedTemporaryFile(..., delete=True)`. We now write the dump into a temp file, copy only after everything succeeds, and let the context manager delete the temp automatically.
1. Added some basic unit tests to make sure the script works as expected 2. Updated the error message in the script when allowlist.json file is not found
Initially queried and added is_nullable in col metadata as was planning to consider this while copying data but decided to ignore this property when creating fake values.
Added more unit tests to test for other functions that were not covered before. Fixed a test get_column_metadata as removed is_nullable from the script but forgot to remove it from test.
1. Removed some tables from allowlist as it was added when I created allowlist using production db as reference. After testing realised that these tables were part of legacy tables which are now replaced by jobserver.User 2. Create a new file for integration test and added some tests to verify every allow-listed table and column exists in the database. Added another test to verify that non allow listed data is faked when copying tables.
Updated _fake_expression function to append ROW_NUMBER() instead of randomly generated number. The reason to do so is that with random generated number, it's possible to generate the same fake value twice and hit a unique constraint. ROW_NUMBER() assigns a sequential number to each row, thus each generated fake value will be unique.
Noticed that the execution schedule is controlled by app.json. @monitor only adjust the sentry monitor. Initially setting the schedule of this job to yearly, as that isn't run by the dokku cron process configured in app.json. So it won't run automatically. We can manually run and verify that the sanitised dump works in this way. Thanks to @mikerkelly for this suggestion. Changes covered in this commit: - Deleted the existing daily dump_sanitised_db.py script and allowlist.json and replaced it with a new yearly version. - Updated the job class to inherit from YearlyJob instead of DailyJob. - Adjusted the cron schedule for the job to run annually. - Modified integration and unit tests to reference the new yearly job. - Fixed 2 unit tests for _fake_expression as they were still expecting random values rather than ROW_NUMBER
- Refactored SQL execution to use psycopg's sql module for safer queries in order to prevent sql injection. - Updated test test_create_safe_table_builds_sql to support this change.
Added a detailed document for the script explaining what the script does, its algorithm, how allow list is created, and other other notes and decision taken.
- Updated _build_select_expressions to throw an error if allowlist references a column that is missing from db. Added this change so that the script fails if there is a change in db but not in allowlist. - Added a unit test to cover this change. - Added FAQ section in document to explain this change and other related questions.
- The restore-db recipe now checks whether the restored dump contains temp_scrubbed_schema. If it does, it drops public and renames the scrubbed schema to public. If not, meaning it’s an unsanitised dump, it leaves public untouched so the restore behaves exactly like it used to. This way restore-db jobserver.dump still restores the raw dump, and restore-db sanitised_jobserver.dump handles the schema rename automatically. Developers can keep using the old flow until we roll out the sanitised dump. - Removed changes to DATABASE_URL in dotenv-sample as we don't need to manually point to temp schema with the changes in justfile.
Removed the mention of sanitised dump script from developers doc because we are not rolling out this script to everyone yet. Once this script is merged, we will test is manually first and once we are satisfied, we will change it from yearly to daily dump and will add its documentation on Developers doc.
addc3ef to
10e54f6
Compare
- Added a script to get a list of columns that are not present in allowlist. - Added all_columns.txt and excluded_columns.txt in .gitignore so that they are not committed.
Changed the location of the script from jobserver/jobs/yearly/ to script/ as only otherwise this script is treated as another Job module. Shifting it into scripts/ folder is better as this is just a helper script.
- Improved docstring for the script and added markdown document to the docstring. - Added manual testing steps to markdown doc and added context for clarity.
- Add information about get_excluded_columns script - Add a diagram to explain flowchart
mikerkelly
left a comment
There was a problem hiding this comment.
Thanks a lot, Nishtha. This is much improved. The overview documentation is excellent, explaining the "why" as well as the "what". The script to get the denied columns is very helpful.
There is a fair amount of new content since my first review, sorry it has taken a while to look at. There are a number of mostly minor inline comments, and I think we need to consider the points below. We are getting there!
Future work
We have discussed here and in Slack some possible future work. Do you have a list of those issues, will you create them?
SQL safety
I am concerned that the script may still be mixing safe and unsafe string interpolation and construction for SQL statements. I think that it is important to get this right as the script may be used on arbitrary data in future, depending how the database schema evolves.
From reading the psycopg2 documentation, I think that it is risky to construct SQL via f-string formatting, str concatenation, % formatting, or by composing any strings generated in these ways. It's just too hard to get the quoting and escaping right and consider all the edge cases. I would much rather fully follow the approach intended by that module and offload all the validation and safety to it, rather than partially roll our own validation with _valid_ident and mix strings validated through that with the other approach, it is hard to reason about whether that has been applied sufficiently and correctly in all instances to produce safe statements. Here is the approach described:
Usually you should express the template of your query as an
SQLinstance with{}-style placeholders and useformat()to merge the variable parts into them, all of which must beComposablesubclasses.
I think _get_column_metadata, _build_select_expressions, _fake_expression, test_create_safe_schema_scrubs_disallowed_columns all interpolate unsafely or include unsafe fragments. I think _create_safe_schema_and_copy and _drop_temp_schema use the right approach. I am not sure we need _valid_ident or _normalize_table_name if we contruct everything through the sql module.
Requirements / correctness
The script executes each line of SQL as it constructs it. I think that this risks the copied tables being inconsistent with related tables if they are modified by another process between lines being written. So should we do all our change inside a transaction and/or take a write lock so nothing can vary while we create the dump?
The script can be triggered manually, so I think we should at least think about what happens if it is run twice at the same time and both edit the non-public schema. I think that would not have a good result. We could add a lockfile or other mechanism so that can not happen, but maybe the approach from the previous paragraph is sufficient (database locking or transactions)?
I wondered it is worth recording row counts for every table in the source and destination schemas and comparing, to check that at least those are consistent. What do you think?
Performance/observability
I have not looked at performance, have you? Other than noting that it runs in some 10s of seconds locally. I think that is enough until we have evidence it is a problem. For observability, can we add an otel span around the whole script and show that on some suitable dashboard or a new one to make it easy to keep an eye on?
There is no logging at all at the moment. Can you please use structlog to record at least start/end of the main function and key params? Logging relevant details in the event of error would also be useful.
Testing
safety: I think each function that can produce SQL statements should have a test with a bad/malicious expression testing that the result is properly escaped sql.sql. I find it hard to be confident we are doing the right thing without that.
observation, no action required: A fuller integration test seems tricky, not sure if we would create a new database just in that one test, or do a a lot of database and file access patching still. Just testing _create_safe_schema_and_copy with one database row and short allowlist as you have done seems like a reasonable compromise.
optional: It would be good to have at least one test of the main execute function, with some appropriate level of patching/mocking. That code is currently untested, I think. Not essential though, manual testing can compensate, and I suppose any future development would want to do manual testing also.
| docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c "pg_restore --clean --if-exists --no-acl --no-owner -d \$DATABASE_URL /tmp/jobserver.dump" | ||
| docker compose run -v "$path:/tmp/jobserver.dump" --entrypoint bash --rm dev -c ' | ||
| set -eux | ||
| pg_restore --clean --if-exists --no-acl --no-owner -d "$DATABASE_URL" /tmp/jobserver.dump |
There was a problem hiding this comment.
minor, documentation: Could use a comment here explaining that if restoring the sanitised dump we need to alter the scratch schema name to public, and a reference to the overview documentation.
| just devenv | ||
| ``` | ||
|
|
||
| 4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server. |
There was a problem hiding this comment.
minor, wording:
| 4. In another terminal, download and restore the latest production database dump to a separate database. We need to do this so that the sanitised db script can use this production like db when creating a sanitised dump. In real case scenario, we would not need to do this as we will be pointing the db on dokku server. | |
| 4. In another terminal, download and restore the latest production database dump to a separate database so we can run the script against it without changing the local development `jobserver` database. In production, the script runs directly against the production `jobserver` database. |
| 6. Build assets | ||
|
|
||
| ``` | ||
| just assets-rebuild |
There was a problem hiding this comment.
Question: is this step necessary?
| just manage runjob dump_sanitised_db | ||
| ``` | ||
|
|
||
| 8. Restore this sanitised db |
There was a problem hiding this comment.
minor, wording:
| 8. Restore this sanitised db | |
| 8. Restore the sanitised database dump into your local `jobserver` database, deleting all existing data. |
| ## Manually testing the script | ||
|
|
||
| We can manually test this script by following these steps: | ||
|
|
There was a problem hiding this comment.
minor: can we warn this will overwrite all their local data and suggest taking a backup before if required?
| def _create_safe_table(self, cur, src_table, dst_table) -> None: | ||
| """Create a destination table identical to the source.""" | ||
| create_sql = sql.SQL( | ||
| "CREATE TABLE IF NOT EXISTS {dst} (LIKE {src} INCLUDING ALL);" | ||
| ).format(dst=dst_table, src=src_table) | ||
| try: | ||
| cur.execute(create_sql) | ||
| except Exception as exc: | ||
| raise RuntimeError(f"Failed to create table {dst_table}: {exc}") |
There was a problem hiding this comment.
naming: why is this "safe"?
documentation: I think it is clearer to say it is "an empty destination table with the same structure as the src".
SQL safety: I think this is only safe if the parameters are sql.Identifier. At the moment they are, but that would be easy to get wrong in future. At least we should provide a type hint and/or mention it in the docstring. I think it would be safer to check types and raise ValueError if not as required. I am also tempted to say we could maybe not have this function at all as it is only used once, I am not sure the indirection adds that much value.
| from jobserver.jobs.yearly import dump_sanitised_db | ||
|
|
||
|
|
||
| @pytest.mark.django_db |
There was a problem hiding this comment.
taste, optional: I am not sure why this is an integration test, could you explain? I think it might live happily in its own file in the root of tests.
| with connection.cursor() as cur: | ||
| cur.execute( | ||
| """ | ||
| SELECT id, username, email, password | ||
| FROM "temp_scrubbed_schema"."jobserver_user" | ||
| WHERE id = %s | ||
| """, | ||
| [user.id], | ||
| ) | ||
| row = cur.fetchone() | ||
| assert row is not None | ||
| assert row[0] == user.id | ||
| assert row[1] == user.username | ||
| assert row[2].startswith("fake_jobserver_user_email_") | ||
| assert row[3].startswith("fake_jobserver_user_password_") |
There was a problem hiding this comment.
This is brittle to the detail of the faking. Can we assert that rows 2 and 3 are not equal to the original value, instead?
| from jobserver.jobs.yearly import dump_sanitised_db | ||
|
|
||
|
|
||
| class StubCursor: |
There was a problem hiding this comment.
optional: The sql parameter names in these classes shadow the imported sql. I think it is prudent to use a different name like query.
| return dump_sanitised_db.Job() | ||
|
|
||
|
|
||
| def test_load_allowlist_success(tmp_path, job): |
There was a problem hiding this comment.
optional, performance: I think your allow list tests are hitting the real file system, which is slow and unnecessary. Can you mock the file access instead?
Issue: Encountered an error of duplicate key value when trying to add a member to a project or run a test job. Reason: The sanitised job copies data into temp_scrubbed_schema but never advanced the per-table sequences, so every *_id_seq remained at 1. When we restored sanitised_jobserver.dump, postgres recreated the tables with real rows but left the sequences at 1, meaning the very next insert reused id=1 and failed with duplicate key value violates …_pkey. Fix: Added a setval() step after each table copy so the scratch schema’s sequence is bumped to MAX(id) (only for tables with an integer id). Because pg_dump captures that setval, the restored dump now carries correct sequence values just like the raw production dump. Testing: After this fix, I was able to test the following actions successfully: - Editing a user's full name - Adding a user to an organisation - Giving access of backend to the user - Added a member to a project - Reach the run jobs page with the ability to ran a job.
Description
Issue - https://github.com/opensafely-core/security/issues/33
DATABASE_URLin.envto point to my local db instead which had copy of dump from production.Manual testing on new setup:
Make sure the
OUTPUT_PATHmentioned in the script is available on your system. I temporarily change the path to./sanitised_jobserver.dump.