|
| 1 | +import { Tabs } from 'nextra/components' |
| 2 | + |
| 3 | +# Bulk Importing Relationships |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to |
| 8 | +set up the initial set of relations. |
| 9 | +This can be done with [`WriteRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.WriteRelationships) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead. |
| 10 | + |
| 11 | +For faster ingest, we provide an [`ImportBulkRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.ImportBulkRelationships) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once. |
| 12 | + |
| 13 | +## Batching |
| 14 | + |
| 15 | +There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request. |
| 16 | +Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster. |
| 17 | + |
| 18 | +The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore. |
| 19 | +Note that you probably **don't** want to push all of your relationships through in a single request, as this could time out in your datastore. |
| 20 | + |
| 21 | +## Example |
| 22 | + |
| 23 | +We'll use the [authzed-dotnet](https://github.com/authzed/authzed-dotnet) client for this example. |
| 24 | +Other client libraries will have different syntax and structures around their streaming and iteration, |
| 25 | +but this should demonstrate the two different levels of chunking that we'll do in the process. |
| 26 | + |
| 27 | +<Tabs items={["Dotnet", "Python"]}> |
| 28 | + <Tabs.Tab> |
| 29 | + ```csharp |
| 30 | + var TOTAL_RELATIONSHIPS_TO_WRITE = 1000; |
| 31 | + var RELATIONSHIPS_PER_TRANSACTION = 100; |
| 32 | + var RELATIONSHIPS_PER_REQUEST_CHUNK = 10; |
| 33 | + |
| 34 | + // Start by breaking the full list into a sequence of chunks where each chunk fits easily |
| 35 | + // into a datastore transaction. |
| 36 | + var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION); |
| 37 | + |
| 38 | + foreach (var relationshipsForRequest in transactionChunks) { |
| 39 | + // For each of those transaction chunks, break it down further into chunks that |
| 40 | + // optimize for network throughput. |
| 41 | + var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK); |
| 42 | + // Open up a client stream to the server for this transaction chunk |
| 43 | + using var importCall = permissionsService.ImportBulkRelationships(); |
| 44 | + foreach (var requestChunk in requestChunks) { |
| 45 | + // For each network chunk, write to the client stream. |
| 46 | + // NOTE: this makes the calls sequentially rather than concurrently; this could be |
| 47 | + // optimized further by using tasks. |
| 48 | + await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{ |
| 49 | + Relationships = { requestChunk } |
| 50 | + }); |
| 51 | + } |
| 52 | + // When we're done with the transaction chunk, complete the call and process the response. |
| 53 | + await importCall.RequestStream.CompleteAsync(); |
| 54 | + var importResponse = await importCall; |
| 55 | + Console.WriteLine("request successful"); |
| 56 | + Console.WriteLine(importResponse.NumLoaded); |
| 57 | + // Repeat! |
| 58 | + } |
| 59 | + ``` |
| 60 | + </Tabs.Tab> |
| 61 | + <Tabs.Tab> |
| 62 | + ```python |
| 63 | + from itertools import batched |
| 64 | + |
| 65 | + TOTAL_RELATIONSHIPS_TO_WRITE = 1_000 |
| 66 | + |
| 67 | + RELATIONSHIPS_PER_TRANSACTION = 100 |
| 68 | + RELATIONSHIPS_PER_REQUEST_CHUNK = 10 |
| 69 | + |
| 70 | + # NOTE: batched takes a larger iterator and makes an iterator of smaller chunks out of it. |
| 71 | + # We iterate over chunks of size RELATIONSHIPS_PER_TRANSACTION, and then we break each request into |
| 72 | + # chunks of size RELATIONSHIPS_PER_REQUEST_CHUNK. |
| 73 | + transaction_chunks = batched( |
| 74 | + all_relationships_to_write, RELATIONSHIPS_PER_TRANSACTION |
| 75 | + ) |
| 76 | + for relationships_for_request in transaction_chunks: |
| 77 | + request_chunks = batched(relationships_for_request, RELATIONSHIPS_PER_REQUEST_CHUNK) |
| 78 | + response = client.ImportBulkRelationships( |
| 79 | + ( |
| 80 | + ImportBulkRelationshipsRequest(relationships=relationships_chunk) |
| 81 | + for relationships_chunk in request_chunks |
| 82 | + ) |
| 83 | + ) |
| 84 | + print("request successful") |
| 85 | + print(response.num_loaded) |
| 86 | + ``` |
| 87 | + </Tabs.Tab> |
| 88 | +</Tabs> |
| 89 | + |
| 90 | +The code for this example is [available here](https://github.com/authzed/authzed-dotnet/blob/main/examples/bulk-import/BulkImport/Program.cs). |
| 91 | + |
| 92 | +## Retrying and Resuming |
| 93 | + |
| 94 | +`ImportBulkRelationships`'s semantics only allow the creation of relationships. |
| 95 | +If a relationship is imported that already exists in the database, it will error. |
| 96 | +This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient |
| 97 | +network conditions. |
| 98 | +The [authzed-go](https://github.com/authzed/authzed-go) client offers a [`RetryableClient`](https://github.com/authzed/authzed-go/blob/main/v1/retryable_client.go) |
| 99 | +with retry logic built into its `ImportBulkRelationships` logic. |
| 100 | + |
| 101 | +This is used internally by [zed](https://github.com/authzed/zed) and is exposed by the `authzed-go` library, and works by |
| 102 | +either skipping over the offending batch if the `Skip` strategy is used or falling back to `WriteRelationships` with a touch |
| 103 | +semantic if the `Touch` strategy is used. |
| 104 | +Similar logic can be implemented using the other client libraries. |
| 105 | + |
| 106 | +## Why does it work this way? |
| 107 | + |
| 108 | +SpiceDB's `ImportBulkRelationships` service uses [gRPC client streaming] as a network optimization. |
| 109 | +It **does not** commit those relationships to your datastore as it receives them, but rather opens a database transaction |
| 110 | +at the start of the call and then commits that transaction when the client ends the stream. |
| 111 | + |
| 112 | +This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach. |
| 113 | +We take this approach because if we were to commit each chunk sent over the network, the semantics |
| 114 | +of server-side errors are ambiguous. |
| 115 | +For example, you might receive an error that closes the stream, but that doesn't necessarily mean |
| 116 | +that the last chunk you sent is where the error happened. |
| 117 | +The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome. |
| 118 | + |
| 119 | +A [gRPC bidirectional streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc) approach could |
| 120 | +help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure |
| 121 | +that every chunk that's written by the client has been acknowledged by the server. |
| 122 | +Requiring multiple client-streaming requests means that you can use normal language error-handling flows |
| 123 | +and know exactly what's been written to the server. |
| 124 | + |
| 125 | +[gRPC client streaming]: https://grpc.io/docs/what-is-grpc/core-concepts/#client-streaming-rpc |
0 commit comments