Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Collaboration long polling fallback #517

Closed
wants to merge 10 commits into from

Conversation

AntoLC
Copy link
Collaborator

@AntoLC AntoLC commented Dec 18, 2024

Purpose

Some users have their websockets blocked, so they cannot collaborate.
If they are connected with other collaborators at the same time, it will create constant conflict in the document.

Proposal

We have succeeded to propose an experience almost as good as with websocket.

  • We will use a http fallback when the websocket is not able to connect.
  • We are still using the Hocus Pocus mechanism, so push and pull are trigger by the Hocus Pocus provider and server.
  • By using the Hocus Pocus mechanism, we are still using y-protocols/sync making our request very light (a few bytes).
  • We are using the SSE (https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events)
    to pull data, to minimize the requests amount and keep as much as possible our documents sync between each others.

Cases we solved:

  • connect even without websockets users altogether
  • keep rights (can edit / can view) by using the same mechanism as with the WS
  • keep the awareness (cursor), sync and doc update
  • keep our requests light
  • add a nginx auth cache system - query the backend 1 time every 30 seconds
  • test what I could

Architecture

flowchart TD
    title1[WebSocket Success]-->Client1(Client)<--->|WebSocket Success|WS1(Websocket) --> Nginx1(Ngnix) <--> Auth1("Auth Sub Request (Django)") --->|With the good right|YServer1("Hocus Pocus Server")
  YServer1 --> WS1
  YServer1 <--> clients(Dispatch to clients)
  title2[WebSocket Fails - Push data]-->Client2(Client)---|WebSocket fails|HTTP2(HTTP) --> Nginx2(Ngnix) <--> Auth2("Auth Sub Request (Django)")--->|With the good right|Express2(Express) --> YServer2("Hocus Pocus Server") --> clients(Dispatch to clients)
  title3[WebSocket Fails - Pull data]-->Client3(Client)<--->|WebSocket fails|SSE(SSE) --> Nginx3(Ngnix) <--> Auth3("Auth Sub Request (Django)") --->|With the good right|Express3(Express) --> YServer3("Listen Hocus Pocus Server")
  YServer3("Listen Hocus Pocus Server") --> SSE
  YServer3("Listen Hocus Pocus Server") <--> clients(Data from clients)
Loading

@AntoLC AntoLC self-assigned this Dec 18, 2024
@AntoLC AntoLC changed the title ✨Collab long polling ✨Collaboration long polling fallback Dec 18, 2024
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 3 times, most recently from 1360973 to 55238a7 Compare December 23, 2024 16:18
@AntoLC AntoLC changed the base branch from main to refacto/collaboration-process December 23, 2024 16:19
@AntoLC AntoLC mentioned this pull request Dec 23, 2024
4 tasks
Base automatically changed from refacto/collaboration-process to main December 24, 2024 11:29
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 4 times, most recently from 137c5b1 to ff343ca Compare December 24, 2024 15:21
@AntoLC AntoLC marked this pull request as ready for review December 24, 2024 15:21
@AntoLC AntoLC requested a review from YousefED December 24, 2024 15:25
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch from ff343ca to d95e892 Compare December 24, 2024 15:32
@virgile-dev
Copy link
Collaborator

Great job @AntoLC !

Copy link
Collaborator

@YousefED YousefED left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AntoLC Nice you got this working. If I see it correctly, you created a new endpoint over which you're always syncing the entire Y.Doc to and from the server.

If I'm not mistaken, normally the Y.js sync protocol is more efficient than this and syncs the exact updates required. What's the reason you went for this approach (new endpoint, syncing entire doc) instead of the proxy approach? I think the proxy approach has some potential advantages:

  • We can keep the same sync protocol, but just switch to a different transport (more efficient and awareness would still work)
  • The HocusPocus side can stay the same, our "fix" would be isolated to a separate layer) (less code complexity and smaller chance of bugs or security issues)

I might be missing some advantages of your current approach, but my concern is mainly that it adds more "custom code" that's another surface we need to test, maintain and secure. The proxy approach would isolate / limit this more I think

@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 4 times, most recently from b24b01c to 3eb9f69 Compare January 21, 2025 14:37
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 4 times, most recently from b8ff4ad to c64f1f2 Compare February 14, 2025 15:58
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch from c64f1f2 to d26da26 Compare February 14, 2025 16:08
@AntoLC
Copy link
Collaborator Author

AntoLC commented Feb 14, 2025

You can test this PR before it is merged on https://docs-ia.beta.numerique.gouv.fr/.
To deactivate the websocket add the query param withoutWS=true

Example public doc: https://docs-ia.beta.numerique.gouv.fr/docs/481a9933-3514-4aeb-9877-c21be1388877/?withoutWS=true

@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 4 times, most recently from c096f35 to 56f9a00 Compare February 14, 2025 19:45
@AntoLC AntoLC requested review from lunika and YousefED February 14, 2025 19:46
@@ -34,6 +41,10 @@ server {
}

location /collaboration-auth {
proxy_cache auth_cache;
proxy_cache_key "$http_authorization";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something more specific to avoid sharing the same cache key later with an other location

Suggested change
proxy_cache_key "$http_authorization";
proxy_cache_key "$http_authorization$request_uri";

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you'r totally right

@@ -82,7 +82,11 @@ ingressCollaborationWS:
## @param ingressCollaborationWS.annotations.nginx.ingress.kubernetes.io/proxy-send-timeout
## @param ingressCollaborationWS.annotations.nginx.ingress.kubernetes.io/upstream-hash-by
annotations:
nginx.ingress.kubernetes.io/auth-cache-key: "$http_authorization"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here ? "$http_authorization$request_uri";

@AntoLC AntoLC force-pushed the feature/collab-long-polling branch from 56f9a00 to 8a27a29 Compare February 20, 2025 10:18
The environment was missing in the Sentry
configuration.
This commit adds the environment to the
Sentry configuration.
We can now interact with the collaboration server
using http requests.
It will be used as a fallback when the websocket
is not working.
2 kind of requests:
 - to send messages to the server we use POST requests
 - to get messages from the server we use a GET
 request using SSE (Server Sent Events)
We will need toBase64 in different features,
better to move it to "doc-management".
Create the CollaborationProvider class.
This class is inherited from HocuspocusProvider class.
This class integrate a fallback mechanism to handle the
case where the user cannot connect with websockets.
It will use post request to send the data to the
collaboration server.
It will use an EventSource to receive the data from the
collaboration server.
We adapt the nginx configuration to works
with http requests and on the collaboration routes.
Requests are light but quite network intensive,
so we add a cache system above "collaboration-auth".
It means the backend will be called only once
every 30 seconds after a 200 response.
We adapt the nginx configuration to works
with http requests and on the collaboration routes.
Requests are light but quite network intensive,
so we add a cache system above "collaboration-auth".
It means the backend will be called only once
every 30 seconds after a 200 response.
Firefox with websocket
Other without
Documentation to describe the collaboration
architecture in the project.
@AntoLC AntoLC force-pushed the feature/collab-long-polling branch 2 times, most recently from 4730321 to f716c49 Compare February 20, 2025 10:21
} as MessageEvent);
}

if (updatedDoc64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have our own logic to apply the message? Isn't it enough (and easier), to let the onMessage function handle this for all cases? Or is there a reason that doesn't work?


Zoomed out; I'm a little concerned by all the manual Yjs operations you have to do. I was hoping you could just use the existing sync-protocol (and code that handles that), but only over a different transport layer. My guess is you ran into issues with this and came up with some workarounds? That does make the code a little more difficult to review (especially when I don't have the context of why which workarounds are necessary).

/**
* Sync the document with the server.
*
* In some rare cases, the document may be out of sync.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, could you also call the existing HP forceSync and let the Yjs sync protocol handle this? passing Yjs documents and updates around seems a little dangerous to me (at least it's difficult for me to verify if this is correct or not)

* Sent to the server the message to
* be sent to the other users
*/
public async onPollOutgoingMessage({ message }: onOutgoingMessageParameters) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused at some parts when checking the code, I think it might be helpful to explain or improve the naming a little bit.

Technically this method is not "polling" anymore, right? When you're polling, you retrieve an update from somewhere, but here we're just sending a message, like a regular POST request.


Also, you're not using Long-Polling at all, because long-polling means (afaik) that you send a request to the server, which keeps the connection open until the server has something interesting to send to you. I don't think this is the case, as you're using SSE for events from server -> client.

Which is fine, but maybe better to avoid the term long-polling then to avoid confusion down the line

@AntoLC
Copy link
Collaborator Author

AntoLC commented Apr 10, 2025

We will close it for now as it didn't work on the target users.

@AntoLC AntoLC closed this Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants