Epic: Implement GitHub integration sync

# Problem

We need to reliably and concurrently sync tasks and comments to and from GitHub in a non-blocking way.

By providing a timestamp-based concurrency control system we can use a known algorithm to make our GitHub integration more robust.

More importantly, we will be able to unblock our other objectives. We cannot proceed with onboarding projects or volunteers unless GitHub sync is stable, since our overall strategy depends on us connecting volunteers to tasks.

## Tasks

### In scope

- [ ] Cleanup and decouple existing modules. Goal is to flatten them out as much as possible, to make it easier to facilitate a queue system
- [ ] Add `TaskSyncOperation` model
- [ ] Add `CommentSyncOperation` model
- [ ] Create a `TaskSyncOperation` when issue webhook is received
- [ ] Create a `TaskSyncOperation` when pull request webhook is received
- [ ] Create a `TaskSyncOperation` when the task is created/updated from the client
- [ ] Create a `CommentSyncOperation` when issue comment webhook is received
- [ ] Create a `CommentSyncOperation` when the comment is created/updated from the client
- [ ] Consider timestamps from GitHub to be the latest - i.e. don’t be pessimistic (due to second-level granularity) https://platform.github.community/t/timestamp-granularity/4663
- [ ] Define proposal for the queuing system
- [ ] Add an admin dashboard for the operations

### Out of scope

- [ ] Add back pressure for rate limits
- [ ] Respond to 304 not modified for both GET and PATCH
- [ ] find_or_create vs create_or_update (we should probably change to find_or_create) → XLinker
- [ ] Add fetch step after receiving the webhook
- [ ] Provide queue feedback to the user for the task
- [ ] Provide queue feedback to the user for the comment
- [ ] Figure out if user’s are only seeing what they’re allowed to see (primary concern are installations)
- [ ] Double-check timestamp when processing
- [ ] Figure out if an atomic step system is feasible, where we would not need operations and instead have each record update be something that’s ok to be executed individually.
- [ ] Think about breaking apart sync steps into their own “operations” vs Ecto.Multi transactions

## Outline

We would have a sync operation for each type of internal record we want to sync. For example:

- `TaskSyncOperation`
- `CommentSyncOperation`

Every sync operation record, regardless of type, would have a:

- `direction` - `:inbound | :outbound`
- `github_app_installation_id` - the `id` of the app installation for this sync
- `github_updated_at` - the last updated at timestamp for the resource on GitHub
- `canceled_by` - the `id` of the `SyncOperation` that canceled this one
- `duplicate_of` - the `id` of the `SyncOperation` that this is a duplicate of
- `dropped_for` - the `id` of the `SyncOperation` that this was dropped in favor of
- `state`
  - `:queued` - waiting to be processed
  - `:processing` - currently being processed; limited to one per instance of the synced record, e.g. `comment_id`
  - `:completed` - successfully synced
  - `:errored` - should be paired with a reason for the error
  - `:canceled` - another operation supersedes this one, so we should not process it
  - `:dropped` - this operation was outdated and was dropped
  - `:duplicate` - another operation already existed that matched the timestamp for this one
  - `:disabled` - we received the operation but cannot sync it because the repo no longer syncs to a project

Then each type would have type-specific fields, e.g. a `CommentSyncOperation` would have:

- `comment_id` - the `id` of our `comment` record
- `github_comment_id` - the `id` of our cached record for the external resource
- `github_comment_external_id` - the `id` of the resource from the external provider (GitHub)

If the event is due to the resource being created, there will not be a conflict. If the resource was created from our own clients, then there is no external GitHub ID yet. The same is true of events coming in from external providers and there is no internal record yet. I'm not yet clear as to whether we should conduct any conflict checking on these event types, but my guess is no. It should likely jump straight to `:processing`.

When an event comes in from GitHub we should (using a `github_comment` as our example):

- delegate to the proper sync operation table for the particular resource (in our example this would be `comment_sync_operations`)
- check if there are _any_ operations for the `github_comment_external_id` where:
  - the `github_updated_at` is _after_ our operation's last updated timestamp (limit 1)
    - if yes, set state to `:dropped` and stop processing, set `dropped_for` to the `id` of the operation in the `limit 1` query
  - the `github_updated_at` timestamp for the relevant `record_` is _equal to_ our operation's last updated timestamp (limit 1)
    - if yes, set state to `:duplicate` and stop processing, set `duplicate_of` to the `id` of the operation in the `limit 1`
  - the `modified_at` timestamp for the relevant `record_` is _after_ our operation's last updated timestamp
    - if yes, set state to `:dropped` and stop processing, set `dropped_for` to the `id` of the operation in the `limit 1` query
- check if there are any _:queued_ operations for the `integration_external_id` where:
  - `github_updated_at` is _before_ our operation's last updated timestamp
    - if yes, set state of those operations to `:canceled` and set `canceled_by` to the `id` of this event
- check if there is any other `:queued` operation  or `:processing` operation for the `integration_external_id`
  - if yes, set state to `:queued`
- when `:processing`, check again to see if we can proceed, then create or update the `comment` through the relationship on the record for `comment_id`
- when `:completed`, kick off process to look for next `:queued` item where the `github_updated_at` timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the operation's timestamp. If it is, then we need to bubble the changeset validation error and mark the operation as `:dropped` per the above.

Some upsides of the approaches above that I wanted to document, in no particular order:

- The tracking above generates some implicit audit trails that will be helpful for debugging.
- Any unique-per-record queued operations can be run in parallel without issue, i.e. we can run operations for `%Comment{id: 1}` and `%Comment{id: 2}` without any conflict.
- We can avoid "thundering herd" problems when the system receives back pressure by having control over precisely how the queue is processed.
- We can use this in conjunction with rate limiting to only process the number of events we have in the queue for the given rate limit and defer further processing until after the rate limit has expired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epic: Implement GitHub integration sync #1368

Problem

Tasks

In scope

Out of scope

Outline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epic: Implement GitHub integration sync #1368

Description

Problem

Tasks

In scope

Out of scope

Outline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions