Add documentation for hash partitioning and the related migration ste…

…ps (#5302) Co-authored-by: Simon Dumas <[email protected]>
BlueBrain · Mar 5, 2025 · 134382b · 134382b
1 parent d5a0a05
commit 134382b
Show file tree

Hide file tree

Showing 7 changed files with 226 additions and 45 deletions.
diff --git a/delta/app/src/main/resources/app.conf b/delta/app/src/main/resources/app.conf
@@ -28,8 +28,13 @@ app {
     # Access to database for streaming access (indexing / SSEs)
     streaming = ${app.defaults.database.access}
 
+    # Partition strategy for the scoped_events and scoped_states tables
     partition-strategy {
+        # List: a partition is created for every project
         type = list
+        # Uncomment to use a hash strategy where the number of partitions will be fixed.
+        #type = hash
+        # modulo = 5
     }
 
     name     = ${app.defaults.database.name}

diff --git a/delta/sourcing-psql/src/main/resources/scripts/postgres/init/V1_12_M01__ship_drop_tables.ddl b/delta/sourcing-psql/src/main/resources/scripts/postgres/init/V1_12_M01__ship_drop_tables.ddl
diff --git a/docs/src/main/paradox/docs/releases/index.md b/docs/src/main/paradox/docs/releases/index.md
@@ -50,6 +50,7 @@ Please test this new version carefully and address any issues on the client side
 
 ### New features / enhancements
 
+- Support for hash partitioning in PostgreSQL
 - Full support of JSON-LD 1.1 via [Apache Jena 5.x](https://jena.apache.org/) and [Titanium JSON-LD](https://github.com/filip26/titanium-json-ld)
 - The default indexing in Elasticsearch has been rewritten to push data in a single Elasticsearch index in order to reduce shard consumption.
 - Projects, resolvers, schemas, storages and views are not indexed anymore to prepare further changes on how they are handled

diff --git a/docs/src/main/paradox/docs/releases/v1.11-to-v1.12-migration.md b/docs/src/main/paradox/docs/releases/v1.11-to-v1.12-migration.md
@@ -26,18 +26,142 @@ and [following the Elastic recommendations about sharding](https://www.elastic.c
 * You can optionally allocate more resources to Delta, PostgreSQL, Blazegraph and Elasticsearch to accommodate for the load related to the indexing
 * You may also want to reindex in Elasticsearch first and then in Blazegraph if the load is too high
 * Run the following SQL script to delete the former default elasticsearch views and their progress:
+
 ```sql
 DELETE FROM scoped_events WHERE type = 'elasticsearch' AND id = 'https://bluebrain.github.io/nexus/vocabulary/defaultElasticSearchIndex';
 DELETE FROM scoped_states WHERE type = 'elasticsearch' AND id = 'https://bluebrain.github.io/nexus/vocabulary/defaultElasticSearchIndex';
 DELETE FROM projection_offsets WHERE module = 'elasticsearch' AND resource_id = 'https://bluebrain.github.io/nexus/vocabulary/defaultElasticSearchIndex';
 DELETE FROM projection_offsets WHERE name = 'event-metrics';
 ```
 * Run the following SQL script to delete the progress of Blazegraph indexing:
+
 ```sql
 DELETE FROM projection_offsets WHERE module = 'blazegraph';
 ```
 * Start Nexus Delta 
 * Monitor the load of the different components and the indexing process by running the following query
+
 ```sql
 SELECT * FROM projection_offsets ORDER BY updated_at DESC;
 ```
+
+# Adopt hash partitioning
+
+* Make a backup of your PostgreSQL instance
+* Make sure to accommodate your deployment so that it can accomodate twice your data during the migration
+* Scale down Delta
+* Run the [following script](https://github.com/BlueBrain/nexus/blob/$git.branch$/delta/sourcing-psql/src/main/resources/scripts/postgres/init/common/V1_12_M02__partition_config.ddl) to init the partition_config table
+* Run the following script to rename the partitioning tables and their indices:
+
+```sql
+-- Renaming scoped_events table and its indices
+ALTER TABLE scoped_events RENAME TO scoped_events_list;
+ALTER INDEX scoped_events_type_idx RENAME TO scoped_events_type_idx_list ;
+ALTER INDEX scoped_events_ordering_idx RENAME TO scoped_events_ordering_idx_list ;
+
+-- Renaming scoped_states table and its indices
+ALTER TABLE scoped_states RENAME TO scoped_states_list;
+ALTER INDEX scoped_states_type_idx RENAME TO scoped_states_type_idx_list ;
+ALTER INDEX scoped_states_ordering_idx RENAME TO scoped_states_ordering_idx_list ;
+ALTER INDEX project_uuid_idx RENAME TO project_uuid_idx_list ;
+ALTER INDEX state_value_types_idx RENAME TO state_value_types_idx_list ;
+```
+* Run the following script to create the new tables:
+
+```sql
+-- scoped_events
+CREATE TABLE IF NOT EXISTS public.scoped_events(
+    ordering bigint       NOT NULL DEFAULT nextval('event_offset'),
+    type     text         NOT NULL,
+    org      text         NOT NULL,
+    project  text         NOT NULL,
+    id       text         NOT NULL,
+    rev      integer      NOT NULL,
+    value    JSONB        NOT NULL,
+    instant  timestamptz  NOT NULL,
+    PRIMARY KEY(org, project, id, rev)
+) PARTITION BY HASH (org, project);
+
+CREATE INDEX IF NOT EXISTS scoped_events_type_idx ON public.scoped_events(type);
+CREATE INDEX IF NOT EXISTS scoped_events_ordering_idx ON public.scoped_events (ordering);
+
+CREATE TABLE public.scoped_states_0000 PARTITION OF public.scoped_states FOR VALUES WITH (MODULUS 2, REMAINDER 0);
+CREATE TABLE public.scoped_states_0001 PARTITION OF public.scoped_states FOR VALUES WITH (MODULUS 2, REMAINDER 1);
+
+-- scoped_states
+CREATE TABLE IF NOT EXISTS public.scoped_states(
+    ordering   bigint       NOT NULL DEFAULT nextval('state_offset'),
+    type       text         NOT NULL,
+    org        text         NOT NULL,
+    project    text         NOT NULL,
+    id         text         NOT NULL,
+    tag        text         NOT NULL,
+    rev        integer      NOT NULL,
+    value      JSONB        NOT NULL,
+    deprecated boolean      NOT NULL,
+    instant    timestamptz  NOT NULL,
+    PRIMARY KEY(org, project, tag, id)
+) PARTITION BY HASH (org, project);
+
+CREATE INDEX IF NOT EXISTS scoped_states_type_idx ON public.scoped_states(type);
+CREATE INDEX IF NOT EXISTS scoped_states_ordering_idx ON public.scoped_states (ordering);
+CREATE INDEX IF NOT EXISTS project_uuid_idx ON public.scoped_states((value->>'uuid')) WHERE type = 'project';
+CREATE INDEX IF NOT EXISTS state_value_types_idx ON public.scoped_states USING GIN ((value->'types'));
+```
+* Start Nexus with a configuration with the hash partition strategy without making it available to users.
+
+The modulo value should remain low(at most a few hundreds) and should be set to fit your number of projects and resources
+
+```hocon
+app {
+  database {
+    partition-strategy {
+     type = hash
+     # Adapt the modulo value to the number of partitions you expect
+     modulo = 5
+    }
+  }
+}  
+```
+* Stop Nexus again once this log appears:
+
+```
+2025-03-05 14:05:25 INFO  c.e.b.n.d.s.p.DatabasePartitioner - The partition strategy has not been set yet, initializing...
+```
+* The following queries should return the different partitions
+
+```sql
+SELECT table_name 
+FROM information_schema.tables WHERE table_name LIKE 'scoped_events_0%';
+```
+
+```sql
+SELECT table_name 
+FROM information_schema.tables WHERE table_name LIKE 'scoped_states_0%';
+```
+* Running a query against the table partition_config should also return a row matching your configuration:
+
+```sql
+SELECT * FROM partition_config;
+```
+* Run the following scripts to copy data to the new tables
+
+```sql
+INSERT INTO scoped_events (SELECT * FROM scoped_events_list);
+INSERT INTO scoped_states (SELECT * FROM scoped_states_list);
+```
+* Run a count against the old tables and the new ones to check they have the same number of rows
+
+```sql
+SELECT count(*) from scoped_events_list;
+SELECT count(*) from scoped_events;
+SELECT count(*) from scoped_states_list;
+SELECT count(*) from scoped_states;
+```
+* Delete the previous tables:
+
+```sql
+DROP TABLE IF EXISTS public.scoped_events_list;
+DROP TABLE IF EXISTS public.scoped_states_list;
+```
+* Start Delta again
diff --git a/docs/src/main/paradox/docs/releases/v1.12-release-notes.md b/docs/src/main/paradox/docs/releases/v1.12-release-notes.md
@@ -19,4 +19,14 @@ This change allow Nexus to provide a full support of JSON-LD 1.1 including featu
 
 ### Default indexing
 
-The default Elasticsearch view now points to a single index to avoid the oversharding problems when a Nexus instance contains a lot of projects.
+The default Elasticsearch view now points to a single index to avoid the oversharding problems when a Nexus instance contains a lot of projects.
+
+### Support of hash partitioning in PostgreSQL
+
+Since 1.8 and the introduction of PostgreSQL as the primary store, Nexus only permitted list partitioning where each project was assigned
+to a PostgreSQL partition.
+
+While this approach stands for deployments where there are few projects which can be quite large, it is not adapted to the scenario where there are
+a lot of small projects. The new hash strategy allows to address this case.
+
+@ref:[See database config for more details](../running-nexus/configuration/index.md#postgres-configuration)
diff --git a/docs/src/main/paradox/docs/running-nexus/configuration/index.md b/docs/src/main/paradox/docs/running-nexus/configuration/index.md
@@ -22,6 +22,7 @@ The configuration flag `akka.http.server.parsing.max-content-length` can be used
 
 ## Postgres configuration
 
+### Database pool
 @link:[The `database` section](https://github.com/BlueBrain/nexus/blob/$git.branch$/delta/app/src/main/resources/app.conf#L23){ open=new } of the configuration defines the postgres specific configuration. As Nexus Delta uses three separate pools ('read', 'write', 'streaming'), it is recommended to set the host, port, database name, username, and password via the `app.defaults.database` field, as it will apply to all pools. It is however possible to accommodate more advanced setups by configuring each pool separately by changing its respective `app.database.{read|write|streaming}` fields. 
 
 The pool size can be set using the `app.defaults.database.access.pool-size` setting for all pools, or individually for each pool (`app.database.{read|write|streaming}.access.pool-size`).
@@ -32,8 +33,22 @@ A default Postgres deployment will limit the number of connections to 100, unles
 
 @@@
 
+### Partitioning
+
+Nexus supports currently two types of partitioning list and hash.
+
+This can be set via the  `app.database.partition-strategy`.
+
+Note that it is an essential setting which can not be changed automatically once it is set, so configure it carefully according to your needs.
+
+More information about partitioning is available @ref:[here](../postgresql.md).
+
+### Init scripts
+
 Before running Nexus Delta, the @link:[init scripts](https://github.com/BlueBrain/nexus/tree/$git.branch$/delta/sourcing-psql/src/main/resources/scripts/postgres/init){ open=new } should be run in the lexicographical order.
 
+The scripts in the subdirectory matching your partitioning strategy should be run first, followed by those in the common folder.
+
 It is possible to let Nexus Delta automatically create them using the following configuration parameters: `app.database.tables-autocreate=true`.
 
 @@@ note { .warning }

diff --git a/docs/src/main/paradox/docs/running-nexus/postgresql.md b/docs/src/main/paradox/docs/running-nexus/postgresql.md
@@ -7,6 +7,76 @@ Since this is the _primary store_ it is the most important system to be
 that Nexus uses in other stores can be recomputed from the one stored in PostgreSQL as the other stores are used as
 mere indexing systems.
 
+### PostgreSQL partitioning
+
+Nexus Delta takes advantage of PostgreSQL's @link:[Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) feature. This allows for improved query performance, and facilitates loading, deleting, or transferring data.
+
+It currently supports 2 partition strategies offered by PostgreSQL:
+
+* List partitioning
+* Hash partitioning
+
+@@@ note { .warning }
+
+List partitioning is the default strategy.
+
+Please be careful when configuring partitioning as Delta does not allow out-of-the box to adopt another one.
+@@@
+
+#### List partitioning
+
+**This strategy is best when you have a small number of projects which can contain a lot of resources.**
+
+The `public.scoped_events` and `public.scoped_states` are partitioned by organization, which is itself partitioned by the projects it contains; this follows the natural hierarchy that can be found in Nexus Delta.
+
+Nexus Delta takes care of handling the creation and deletion of the partitions.
+
+* If the created project is the first one of a given organization, both the organization partition and the project subpartition will be created.
+* If the organization partition already exist, then only the project subpartition will be created upon project creation.
+
+The naming scheme of the (sub)partitions is as follows:
+
+`{table_name}_{MD5_org_hash}` for organization partitions
+
+`{table_name}_{MD5_project_hash}` for project partition
+
+where
+
+* `{table_name}` is either `scoped_events` or `scoped_states`
+* `{MD5_org_hash}` is the MD5 hash of the organization name
+* `{MD5_project_hash}` is the MD5 hash of the project reference (i.e. has the form `{org_name}/{project_name}`)
+
+MD5 hashing is used in order to guarantee a constant partition name length (PostgreSQL table names are limited to 63 character by default), as well as to avoid any special characters that might be allowed in project names but not in PostgreSQL table names (such as `-`).
+
+Example:
+
+You create the organization called `myorg`, inside of which you create the `myproject` project. When the project is created, Nexus Delta will have created the following partitions:
+
+* `scoped_events_B665280652D01C4679777AFD9861170C`, the partition of events from the `myorg` organization
+  * `scoped_events_7922DA7049D5E38C83053EE145B27596`, the subpartition of the events from the `myorg/myproject` project
+* `scoped_states_B665280652D01C4679777AFD9861170C`, the partition of states from the `myorg` organization
+  * `scoped_states_7922DA7049D5E38C83053EE145B27596`, the subpartition of the states from the `myorg/myproject` project
+
+#### Hash partitioning
+
+Since 1.12, Nexus allows hash partitioning. This strategy is best when you have a lot of projects containing a small amount of resources.
+
+The `public.scoped_events` and `public.scoped_states` are partitioned with a fixed number of partitions where all events and states from a given project 
+will end up in the same partition.
+
+Nexus takes care of creating the different partitions the first time it starts with this configuration.
+
+The naming scheme of the partitions is as follows:
+
+- `{table_name}_{(%04d)}` (ex: `scoped_events_0001`)
+
+#### Advanced subpartitioning
+
+While Nexus Delta provides table partitioning out-of-the-box, it is primarily addressing the case where the data is more or less uniformly spread out across multiple projects. If however there is one or more project that are very large,
+it is possible to add further subpartitions according to a custom rule. This custom subpartitioning must be decided on a case-by-case basis using your knowledge of the given project; the idea is to create uniform partitions of your project.
+
+Please refer to the @link:[PostgreSQL Table Partitioning documentation](https://www.postgresql.org/docs/current/ddl-partitioning.html).
+
 ## Tables
 
 ### Initializing the schema
@@ -90,47 +160,6 @@ Stores the last_update and last_ordering properties for the different projects.
 
 This table allows to power the passivation strategy for the different views.
 
-### PostgreSQL partitioning
-
-Nexus Delta takes advantage of PostgreSQL's @link:[Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) feature. This allows for improved query performance, and facilitates loading, deleting, or transferring data.
-
-The `public.scoped_events` and `public.scoped_states` are partitioned by organization, which is itself partitioned by the projects it contains; this follows the natural hierarchy that can be found in Nexus Delta.
-
-Nexus Delta takes care of handling the creation and deletion of the partitions.
-
-* If the created project is the first one of a given organization, both the organization partition and the project subpartition will be created.
-* If the organization partition already exist, then only the project subpartition will be created upon project creation.
-
-The naming scheme of the (sub)partitions is as follows:
-
-`{table_name}_{MD5_org_hash}` for organization partitions
-
-`{table_name}_{MD5_project_hash}` for project partition
-
-where
-
-* `{table_name}` is either `scoped_events` or `scoped_states`
-* `{MD5_org_hash}` is the MD5 hash of the organization name
-* `{MD5_project_hash}` is the MD5 hash of the project reference (i.e. has the form `{org_name}/{project_name}`)
-
-MD5 hashing is used in order to guarantee a constant partition name length (PostgreSQL table names are limited to 63 character by default), as well as to avoid any special characters that might be allowed in project names but not in PostgreSQL table names (such as `-`).
-
-Example:
-
-You create the organization called `myorg`, inside of which you create the `myproject` project. When the project is created, Nexus Delta will have created the following partitions:
-
-* `scoped_events_B665280652D01C4679777AFD9861170C`, the partition of events from the `myorg` organization
-    * `scoped_events_7922DA7049D5E38C83053EE145B27596`, the subpartition of the events from the `myorg/myproject` project
-* `scoped_states_B665280652D01C4679777AFD9861170C`, the partition of states from the `myorg` organization
-    * `scoped_states_7922DA7049D5E38C83053EE145B27596`, the subpartition of the states from the `myorg/myproject` project
-
-#### Advanced subpartitioning
-
-While Nexus Delta provides table partitioning out-of-the-box, it is primarily addressing the case where the data is more or less uniformly spread out across multiple projects. If however there is one or more project that are very large,
-it is possible to add further subpartitions according to a custom rule. This custom subpartitioning must be decided on a case-by-case basis using your knowledge of the given project; the idea is to create uniform partitions of your project.
-
-Please refer to the @link:[PostgreSQL Table Partitioning documentation](https://www.postgresql.org/docs/current/ddl-partitioning.html).
-
 ## Running and monitoring
 
 **CPU:**