Description
Specification
We can separate the networking to 3 layers:
- P2P application layer - e.g. kademlia, automerge, nodes domain
- RPC layer - e.g. grpc (and once considered jsonrpc but it's too late now)
- Data Transfer layer - UTP, Wireguard, QUIC
The Data Transfer Layer is particularly special since it is the lowest part of our stack and is always fundamentally built on top of UDP. These are its characteristics:
- Must be built on top of UDP (due to hole-punching requirements)
- Capable of NAT-traversal via hole punch packets
- Capable of being proxied for traversing symmetric NAT
- End to end encrypted - based on MTLS or otherwise
- Low latency and high throughput (and ideally low power for maintaining the connection liveness)
- Compatible with Linux, Mac, Windows, iOS, Android
Currently we use the TLS on top of UTP protocol as the Data Transfer Layer. This is implemented with a forward and reverse proxy termination points bridging TCP to the UTP protocol.
The proxy termination bridges was necessary due to us using GRPC as the RPC layer and its implementation fixing on HTTP2 stack, but with an escape hatch via a connect proxy. This is more of an implementation detail than anything else. But it does give us a fairly generic system that can turn any TCP-protocol into something NAT-traversable.
The usage of MTLS also enables a seamless usage of the same X.509 certificate that we already use for the rest of PK.
The underlying library being used here is utp-native
https://github.com/mafintosh/utp-native, it is C++ module that incorporates the utp C++ library https://github.com/bittorrent/libutp and wraps it into a NodeJS module. It works fine for Linux, Window, and Mac. However there is no clear path for usage on Android or iOS. See mafintosh/utp-native#30. This could be done by using https://github.com/janeasystems/nodejs-mobile (but this is not as popular), or by compiling libutp natively for iOS and Android, and then wrapping it out as native code on NativeScript/React Native.
The utp-native
is also OLD. It has some issues like:
- Closing UTP server takes a long time mafintosh/utp-native#40
- Poor documentation that we had to work out how it worked at a low level by going into the source code
- Doesn't support IPv6 IPv6 compatibility mafintosh/utp-native#15
All of this means that continuing down with trying to use utp-native
might just mean flogging a dead horse.
An alternative already existed, and we had previously used it inside Matrix OS and that's Wireguard. The reason for not using when we first started is that there are no nodejs libraries available for it when we started, and we need something quick to prototype with. Many existing P2P applications have been built on top of UTP protocol especially in the NodeJS ecosystem, so that's basically where we started. Even then we went on a journey trying to use a raw JS UTP library that didn't work before eventually arriving on utp-native
and still having to adapt it in our network
domain.
Trying to use WG will be a lot of work however, and there are many things we have to consider if it is going to work.
Wireguard it's own issues. It is of course will be a C/C++ codebase as well. Originally it was made for Linux only. Now it is available inside and outside the Linux kernel. However for an application like PK, wireguard would have to be a userspace library. The great things all of this is now available: https://github.com/cloudflare/boringtun. With boringtun, it is claimed that it works all major desktops and android/ios and it's all userspace. It's a rust library exposing a C interface that can be wrapped as a native module in JS (just like how we use utp-native
and leveldown
). It is however NEW and so may a bunch of bugs: https://github.com/cloudflare/boringtun/issues
An additional issue is that Wireguard doesn't use X.509 certificates. It would completely replace the MTLS portion of the codebase, this is fine as we can always derive subkeys from the rootkey for WG utilisation. We would need to however understand how to deal with the certificate verification given that we use a cert chain when rotating root keys. There is no chain in Wireguard, so any key rotation here would end up breaking any connections, unless one were to connect and then verify at a higher level.
As for hole punching, it's possible that it does this automatically, but we would need to investigate its interface for hole punching to see how we would implement our hole punch relay and proxy relay mechanisms #182.
One advantage of using Wireguard is that we are already using Wireguard inside MatrixOS, and we can share expertise and knowledge/tooling between MatrixOS and Polykey. Only note that WG in MatrixOS is the in-kernel one, and not a userspace one. And our work in hole punch and proxy relay could then be shared to MatrixOS which can benefit from it as well.
Another alternative is QUIC. This is now available natively in NodeJS:
- https://www.nearform.com/blog/a-quic-update-for-node-js/
- https://nodejs.org/en/blog/release/v15.0.0/#quic-32379
- https://github.com/nodejs/quic/blob/master/doc/api/quic.md
Because QUIC is so low level. It seems like a drop-in replacement for the combination of TLS + UTP. One advantage is that this drops the utp-native
dependency requirement. However this doesn't solve how one might use QUIC on Android/iOS which is the main reason we want to make a switch. If we are going to do a whole heap of work to make use of UTP on Android/iOS we might as well spend that work upgrading to a more well-supported system.
One huge advantage of QUIC is that we can maintain the usage of TLS that we already use to secure GRPC client TLS #229, and doesn't involve a different protocol. It seems TLS isn't going anywhere, and wireguard is unlikely to ever be used in general web contexts which relies on the certificate authority system. There is also a risk that wireguard packets maybe blocked on corporate firewalls unlike QUIC which is going to look like HTTP3 packets.
This is likely to impact the browser integration where a browser extension is acting as a client. Already we have problems with using GRPC in our RPC layer so that the browser extension can use the same client protocol as we do with our CLI and GUI, so adding in wireguard is not going to help in the case of CLI/GUI and browser extension communication unless this gets resolve: cloudflare/boringtun#139. So it does seem choosing wireguard would bifurcate our data transfer layer between agent to agent and agent to client which is also not nice.
Integration and migration
With the quic system functional we can begin the migration to using quic.
There are two parts to this, the server side and client side. Client side is made up of the nodes domain with the NodeConenction
encapsulating the QUICClient
and RPCClient
. This should be a reasonable drop-in for the existing systems.The server side is made up of a single QUICServer
and a RPCServer
with the server manifest.
Additional context
- https://github.com/janeasystems/nodejs-mobile-react-native - this shows how to use nodejs-mobile with react native. I'm not sure how stable working on nodejs-mobile would be. And whether the same idea would work in NativeScript.
- https://github.com/neon-bindings/neon - helping wrap Rust libraries as Nodejs native modules
- https://blog.logrocket.com/rust-and-node-js-a-match-made-in-heaven/
- https://github.com/malcolmseyd/natpunch-go
- https://www.jordanwhited.com/posts/wireguard-endpoint-discovery-nat-traversal/
- Clarify whether the sockets are UDP or TCP? atek-cloud/spork#4 - Spork and Hyperswarm also uses utp-native at the bottom of the stack, however their other libraries may be useful to take inspiration from
- https://news.ycombinator.com/item?id=28884938 & https://fly.io/blog/ssh-and-user-mode-ip-wireguard/ - for user-space wireguard
- https://www.rfc-editor.org/rfc/rfc9000.html - QUIC spec
- https://www.rfc-editor.org/rfc/rfc9221.html - QUIC unreliable datagram
- A series of PRs that introduced native addons and discussed how they worked:
At any case we are going to probably need to drop down to native to make sure that we can support all platforms.
Tasks
- 1. nodes domain needs changes
- 1.
NodeConnection
needs to be gutted and replaced withRPCClient
andQUICClient
usage. Besides this, usage of theNodeConnection
is mostly the same. - 2. Tests need to be updated
- 1.
- 2. Verification logic needs to be transplanted for use with quic.
- 1. This needs to be tested.
- 3. Ensure that the proper connection information is provided by the streams from the quic system.
- 1. This needs to be tested.
- 4.
PolykeyAgent
needs to be updated- 1. GRPC agent server needs to be replaced with a
RPCServer
andQUICServer
combo - 2.
Proxy
needs to be removed.
- 1. GRPC agent server needs to be replaced with a
- 5.
Agent
domain needs to be migrated to using the agnostic RPC code.- 1. Tests need to be migrated
- 6. Old code needs to be removed
- 1. network domain gutted
- 2. GRPC domain gutted
- 3. Remove protobuf? and other package dependencies that are not used anymore.
- 7. tests!
- 1.
network
domain tests need to be removed, any tests still needed should be transplanted. - 2.
grpc
domain tests need to be removed, any tests still needed should be transplanted.
- 1.
[ ] 8. Update relevant handlers with pagination[ ] 9. Update agent handlers to be timed cancellable, implement cancellation.