A different idea for managing `schema.sql` #433

jkeifer · 2023-04-19T20:43:58Z

jkeifer
Apr 19, 2023

Hello! 👋

First off, I want to say that I have really liked using dbmate and think it is the top tool I have used in this space. I appreciate all the work that has gone into it and the fact that it is has been available for the benefit of the community. So thank you. 🙏

After using it on a certain project, I did end up with an idea about a different way to manage schema.sql, which I wanted to share and get feedback on. First, let me start by outlining some pain points I've encountered with the current dbmate behavior of managing the schema.sql as an auto-generated schema-only dump:

Extensions bloat the schema

Sometimes this effect can be significant, and essentially hides what a project's migrations directly manage in the database vs state external to the project managed by extension dependencies. Said another way, I don't want to add 2000 lines to the schema from an extension when the migrations I manage to create the database total 200 lines, only one of which is CREATE EXTENSION <some_extension>;. Those 2000 schema lines I would consider outside the concern of my project, even if my project depends on them, as my project is not maintaining them.
Time-based schema changes can occur in the absence of changes to the migrations

Think things like partitioned tables. If someone comes and runs dbmate up in a dev database to run some tests for unrelated application code, the schema.sql can have changes despite the fact that nothing actually changed with regard to the migrations that were run.
"Schema data" is omitted from schema-only dumps

A not-uncommon pattern when needing a fixed set of values to refer to in other tables is to create a table and populate it with a fixed set of rows in a migration file. Sure, enums can sometimes be used for this purpose instead, and would be part of a schema-only dump. But enums have significant limitations and downsides, and are not always adequate for all use cases. So it is not uncommon to need such a table where the data in the table is effectively part of the schema.

Unfortunately, schema-only dumps would never contain these rows, and therefore the schema.sql in its current form is useless for auditing such "schema data."

I think the point of schema.sql is to provide some usable reference as to what the sum total of the migrations would produce when applied together, which is quite desirable. But for the reasons above, I am finding it difficult to use schema.sql for this purpose.

To make schema.sql more usable for my team, we came to the conclusion that it needs to be a manually-managed file which is, effectively, a single migration that condenses all the current migrations into a single file. Then, when making changes, we can edit schema.sql to declare what the desired state of the database should be, and then we can create a migration that performs whatever steps need to happen to move from the previous migration state to the new desired state.[0]

At the end, with both the schema.sql changes and the new migration file in place, we can:

create the database from the reference schema.sql
- we also inject all migrations into the schema_migrations table to make it look like they have been applied
dump the database (data and all, not just schema-only)
drop the database
create the database and apply all migrations
dump the database (again, data and all, not just schema-only)
drop the database
diff the dumps

If we get no differences, then the result of all the migrations matches the desired schema. If we do have differences, then we get some feedback on what is out-of-sync between the migrations and the reference schema.

By making schema.sql this manually-managed file, we address all of the above concerns of that file being generated as a schema-only dump, and end up with a single reference point for all state our project is concerned with enforcing in the database.

We are actually using this workflow now via a hacky solution I implemented by wrapping dbmate in a bash script: https://gist.github.com/jkeifer/f75c65213c6a327229cf85ffa47e1efe. Effectively, the changes to dbmate ended up as follows:

Automatic schema-only dumps are disabled by DBMATE_NO_DUMP_SCHEMA=true
dbmate create is overridden to applying the reference schema.sql file and adding the extant migration versions after creating the database (I suppose these actions could be split into separate commands, but for convenience with this workflow in mind I kept them together)
A new verify command does the verification steps outlined above to diff a database from the schema with a database from the migrations
dbmate dump is overridden to perform full dumps (as used by the verify command) and to not by default write to schema.sql

Really, I think these changes would be rather minor to implement within dbmate. That said, I recognize they would likely require a non-backwards-compatible changes. And these changes could be significant for anyone using dbmate in a way that is: 1) dependent on its current behavior around schema.sql, and 2) not running into any of the issues I mentioned. I also recognize that this idea revolves around a perhaps opinionated way of how schema.sql should be managed, and that opinion may not be shared by all.

So, after saying all that, I am looking for feedback on the above idea both generally (do people like it), and if it has a place in dbmate or not.

I would be willing to put together a PR if the backward-compatibility concern was not a barrier or if any way to implement this behavior in a backward-compatible way were to be offered. In the event that dbmate turns out not to be the right place for such changes, I am also considering creating a parallel project to implement a cli for this workflow on top of dbmate as the backing library.

Thanks for the consideration and feedback on this idea! 😄

[0] This is, in fact, not our idea. I have to credit @mwblakley as the source.

gregwebs · 2023-04-20T16:59:35Z

gregwebs
Apr 20, 2023

You might be interested in diff-based migration tools:

pgadmin-schema-diff: The downside of pgadmin is that it’s SQL is not guaranteed to work as is due to ordering issues- these must be fixed manually.
Sqldiff: uses the Postgres parser but then has to translate all those nodes and understand them and compare them to what is in the metadata. Missing support for some things- it still errors out on functions.

These support generating the migration from maintaining the schema file.
I would like to start using these. First though, I needed to get dbmate working well. The issue being that these other tools can't be used for all types of migrations, for example a data migration or specifying operational instructions for how something is created like CREATE INDEX CONCURRENTLY. So I still need dbmate.

1 reply

jkeifer Apr 21, 2023
Author

So to be clear, I am not looking to generate the migration files. I think that could be a possible optimization (when creating a new migration start the changes in the schema.sql then create a new migration that has the probable required schema changes in the migration format) but is not something I am concerned with right now.

But the tools you linked to suffer from the problems you mention: they don't cover all change types in all cases, and they have no provision for data migrations.

So I still need dbmate.

I reach the same conclusion as you. I still need dbmate and am not proposing getting rid of it, I'd just like to manage the schema.sql as I suggest above to solve the problems I have with the automated dbmate schema-only dumps.

In any case, thanks for the links. I had seen sqldiff, but I was unaware of the work supabase did with the pgadmin diff tooling. I may look more into that, in case that could be an effective mechanism to add on to "optimize" creating new migration files when the schema.sql has been modified.

gregwebs · 2023-04-21T13:16:37Z

gregwebs
Apr 21, 2023

Are you checkpointing this process? That is many old migrations can be rolled up into a single first migration.

2 replies

gregwebs Apr 21, 2023

That first migration would look like (an older) schema.sql

jkeifer Apr 21, 2023
Author

I also thought about that idea and how this workflow enables it. But to be honest we are just starting a new project, so we don't have many migrations to roll up. So to answer your question, no, but it does seem like a cool "feature".

gregwebs · 2023-04-21T13:23:35Z

gregwebs
Apr 21, 2023

I am also considering creating a parallel project to implement a cli for this workflow on top of dbmate as the backing library.

The shell script does this now? Or it is it lacking in some way? Or you want a single go binary?

1 reply

jkeifer Apr 21, 2023
Author

The shell script does this now, but it isn't ideal. It really is a bit of a hack.

Specifically, the feedback I've received about the shell script points at a variety of concerns:

Maintenance
- Brittle things like handling help messages on behalf of the real dbmate
Distribution
- Everything is just a copy of the gist
- No great way to get updates if something changes upstream, have to modify every copy
- The single versioned binary dbmate provides is a great solution to this problem
Different .env handling than dbmate
- It's possible dbmate might resolve different connection parameters than the simple sourcing of .env that's happening in the wrapper
User confusion
- We're using dbmate, but it's not really dbmate
- No one single source of documentation; we want to either use dbmate or some other stand-alone tool with it's own documentation (but, with the idea of the "parallel project", this would be a golang tool backed by dbmate as a library)

On the plus side, the shell script is extensible. For example, on our project I added a custom test command to run pg_prove with the same db connection parameters loaded from .env to run our database tests. But that's not a huge deal, I will just move this to a standalone script when the wrapper is replaced.

I would say from my perspective, the ideal would be an implementation of these changes in dbmate itself. As I mentioned though, I am not sure if there would be a way to make a sane cli for this in a way that would be backwards compatible with the current dbmate behavior. This is the sole reason I would consider creating a new project to implement a cli on top of the dbmate library.

gregwebs · 2023-04-21T13:35:31Z

gregwebs
Apr 21, 2023

I like what you are doing here- I have been thinking about similar things. To add to what this enables- it becomes possible to checkout any commit and migrate the database to what is required at that checkout. Checking in the schema-only file get close, but as stated here the "schema data" is missing.

I like the pattern of using a foreign key to a small separate table as an enum- if that key is a string one gets readability and some flexibility to be able to change the enum data with DML instead of DDL.

0 replies

gregwebs · 2023-04-21T13:43:23Z

gregwebs
Apr 21, 2023

Extensions bloat the schema

I am wondering if this can be solved by having a separate schema file for the extensions?

As a project grows, even without extensions the schema can get quite large- there could be some benefit to spreading the schema across multiple files. However, that gets into issue with dependencies.

1 reply

jkeifer Apr 21, 2023
Author

I guess one point of this idea extensions don't bloat the schema. You have the CREATE EXTENSION <extension_name>; statement in your schema.sql and that's it.

I get that maybe some people might want to see all their extension schemas in their schema.sql, and thus this is not desirable to them. My perspective is that an extension is like an import of an external library in any other language: you don't typically have the source code from external libraries in your project repo, so why have the "source code" of your extensions in your schema?

Handling a schema split across multiple files was something I also thought about, but I got hung up on the dependency issue too. It could be interesting to work out a solution for that, but I felt that could be a later optimization and is orthogonal to the problems I wanted to solve here.

gregwebs · 2023-04-21T17:37:44Z

gregwebs
Apr 21, 2023

dbmate dump is overridden to perform full dumps (as used by the verify command) and to not by default write to schema.sql

I think it would be pretty easy to add a command line option to not do a schema-only dump and implement it in the Postgres driver and at least return an unimplemented error elsewhere. There is already a --schema-file option.

0 replies

gregwebs · 2023-04-21T17:38:57Z

gregwebs
Apr 21, 2023

dbmate create is overridden to applying the reference schema.sql file and adding the extant migration versions after creating the database (I suppose these actions could be split into separate commands, but for convenience with this workflow in mind I kept them together)

Sorry, I think I am not understanding what you are saying here. It seems like either using a dump that is not schema-only or using dbmate up.

0 replies

gregwebs · 2023-04-21T17:44:59Z

gregwebs
Apr 21, 2023

I think you can get all the functionality you need out of dbmate by adding to it in a backwards-compatible way. However, you want to have users use dbmate with a particular workflow.

We have dbmate in a directory with a .env file. In that directory we also have the schema.sql and a directory for the migrations. We drive it with a justfile which provides a nice UI to the workflows we want to use, and it makes sure things are configured the way we want it.

Not sure if that would work for you, at least in the short-term. In the long-term dbmate could more natively support your workflows, but in the short-term just adding a few flags in a backwards compatible way seems like a good first step.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A different idea for managing `schema.sql` #433

{{title}}

Replies: 8 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A different idea for managing schema.sql #433

Replies: 8 comments · 5 replies

jkeifer Apr 21, 2023 Author

jkeifer Apr 21, 2023 Author

jkeifer Apr 21, 2023 Author

jkeifer Apr 21, 2023 Author

A different idea for managing `schema.sql` #433

Replies: 8 comments 5 replies

jkeifer Apr 21, 2023
Author

jkeifer Apr 21, 2023
Author

jkeifer Apr 21, 2023
Author

jkeifer Apr 21, 2023
Author