Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encodeBase64/decodeBase64 seems to be inefficient #5944

Open
sigmaSd opened this issue Sep 11, 2024 · 11 comments
Open

encodeBase64/decodeBase64 seems to be inefficient #5944

sigmaSd opened this issue Sep 11, 2024 · 11 comments
Labels

Comments

@sigmaSd
Copy link
Contributor

sigmaSd commented Sep 11, 2024

These function use atob and btoa internally and they oom easily with large strings

Maybe instead they should use node:buffer (Buffer.from(i).toString("base64") Buffer.from(i,"base64"))

@sigmaSd sigmaSd added bug Something isn't working needs triage labels Sep 11, 2024
@iuioiua iuioiua added encoding and removed bug Something isn't working needs triage labels Sep 12, 2024
@iuioiua
Copy link
Collaborator

iuioiua commented Sep 12, 2024

We wouldn't use node:buffer in order to maintain browser compatibility. Either way, any PRs that improve the performance of these functions are welcome.

@FergoTheGreat
Copy link

Only atob is being used, not btoa . Also, it's not being used in the btoa(String.fromCharCode(...bytes)) manner that I typically associate with call stack size issues. Do you mean to say you are actually running out of heap memory? How large is "large"?

@sigmaSd
Copy link
Contributor Author

sigmaSd commented Sep 15, 2024

import { encodeBase64 } from "jsr:@std/encoding/base64";

encodeBase64(Deno.readFileSync(Deno.execPath()));

this crashes with out of memory error, if I use node buffer insteada it finish in a couple of milliseconds

@FergoTheGreat
Copy link

FergoTheGreat commented Sep 15, 2024

It seems that the memory exhaustion is likely coming from all of the string appending. I imagine that this causes a memory allocation with every new character that is appended. I have no idea if this change makes this as fast as node (probably not,) but at the very least it eliminates unnecessary memory allocations.

>>> NOT TESTED <<<

const decoder = new TextDecoder()

const base64abc = Uint8Array.from([
    65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
    80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100,
    101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
    113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 48, 49,
    50, 51, 52, 53, 54, 55, 56, 57, 43, 47
]);

export function encodeBase64(data: ArrayBuffer | Uint8Array | string): string {
    // CREDIT: https://gist.github.com/enepomnyaschih/72c423f727d395eeaa09697058238727
    const uint8 = validateBinaryLike(data);
    const result = new Uint8Array(Math.floor((uint8.length + 2) / 3) * 4);
    let i, j = 0;
    const l = uint8.length;
    for (i = 2; i < l; i += 3) {
        result[j++] = base64abc[(uint8[i - 2] >> 2)];
        result[j++] = base64abc[
            (((uint8[i - 2] & 0x03) << 4) | ((uint8[i - 1] >> 4)))
        ];
        result[j++] = base64abc[
            (((uint8[i - 1] & 0x0f) << 2) | ((uint8[i] >> 6)))
        ];
        result[j++] = base64abc[(uint8[i] & 0x3f)];
    }
    if (i === l + 1) {
        // 1 octet yet to write
        result[j++] = base64abc[(uint8[i - 2] >> 2)];
        result[j++] = base64abc[((uint8[i - 2] & 0x03) << 4)];
        result[j++] = 61; // '='
        result[j++] = 61; // '='
    }
    if (i === l) {
        // 2 octets yet to write
        result[j++] = base64abc[(uint8[i - 2] >> 2)];
        result[j++] = base64abc[
            (((uint8[i - 2] & 0x03) << 4) | ((uint8[i - 1] >> 4)))
        ];
        result[j++] = base64abc[((uint8[i - 1] & 0x0f) << 2)];
        result[j++] = 61; // '='
    }
    return decoder.decode(result);
}

@sigmaSd
Copy link
Contributor Author

sigmaSd commented Sep 16, 2024

What about running the node code if we're on a server context and the current version (or an optimized one) if we detect we're on the browser

@sigmaSd
Copy link
Contributor Author

sigmaSd commented Sep 16, 2024

Also is it ok for std to use wasm libraries, in that case even the browser branch can be made efficient

@FergoTheGreat
Copy link

On large buffers, it seems like Node's Buffer.toString("base64") is about 6-7 times faster than the code I provided.

@kt3k
Copy link
Member

kt3k commented Sep 18, 2024

import { encodeBase64 } from "jsr:@std/encoding/base64";

encodeBase64(Deno.readFileSync(Deno.execPath()));

@FergoTheGreat Can you check that the above snippet passes with your change? If it passes, it would be far better than the current state.

@sigmaSd
Copy link
Contributor Author

sigmaSd commented Sep 18, 2024

Presonally I think you should just use wasm for this , it's the best use case of wasm (pure math work), I imagine it's going to be faster then any optimized js version https://docs.rs/base64/latest/base64/ should work well

Also can you confirm it's not possible or a good idea to use node apis conditionally when we detect we're not on the browser?

@FergoTheGreat
Copy link

import { encodeBase64 } from "jsr:@std/encoding/base64";

encodeBase64(Deno.readFileSync(Deno.execPath()));

@FergoTheGreat Can you check that the above snippet passes with your change? If it passes, it would be far better than the current state.

With that change, the only limiting factor is that decoder.decode cannot produce a string with length larger than 2**29-24, meaning that it will fail for buffers > 402,653,166 bytes.

@kt3k
Copy link
Member

kt3k commented Sep 19, 2024

Also can you confirm it's not possible or a good idea to use node apis conditionally when we detect we're not on the browser?

I'm not in favor of changing implementation completely based on the environment.

If I looked at the Deno implementation, atob and Buffer.from(string, "base64") are backed by the very similar ops https://github.com/denoland/deno/blob/f460188e583f00144000aa0d8ade08218d47c3c1/ext/web/lib.rs#L131-L144
I'm not sure switching to Buffer.from will boost the performance significantly.

Another point is that encoding and decoding of Base64 are planned as a language feature now (It seems already at Stage 3) https://github.com/tc39/proposal-arraybuffer-base64
Waiting for this to be added in CLI would be also an option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants