feat(path): allow utf8 chars in path #178

joelwurtz · 2024-08-29T10:16:46Z

I'm not used to work with simd parser so i have done what i think is right but not sure if it's right in all use case.

Mainly the uri non compliant rfc3986 parser is the same as the current one but allow every char between 128 to 255 which sould make it utf8 compliant, if i recall correctly code point always are above or equal to 128

I did not make a feature but rather an option in parser, however i wonder if this should be true by default ? or if it should be a feature (enabled by default ?) ?

joelwurtz · 2024-09-03T08:53:35Z

After playing with this, i think it should be a feature instead of a config, and it should be enabled by default.

Even if it's not really compliant with the RFC there seems to be some browsers that include utf8 in path of a HTTP request and without this it failed, having a feature allow someone to create a strict server but IMO default use case should handle most of web browser.

I also added a new Error if the given utf8 in path is not valid

joelwurtz · 2024-09-03T09:07:15Z

src/lib.rs

@@ -963,6 +989,14 @@ pub fn parse_uri<'a>(bytes: &mut Bytes<'a>) -> Result<&'a str> {
            return Err(Error::Token);
        }

+        #[cfg(feature = "utf8_in_path")]
+        // SAFETY: all bytes up till `i` must have been `is_token` and therefore also utf-8.
+        return match str::from_utf8(unsafe { bytes.slice_skip(1) }) {


This induce an overhead in uri benches, so not sure if we want this ? but in this case we are not sure that the string is valid utf8 unless using it (see the relative test)

group no_utf8 utf8 ----- ------- ---- uri/uri_0001b 1.00 1.1±0.01ns 1786.2 MB/sec 4.58 4.9±0.09ns 390.1 MB/sec uri/uri_0002b 1.00 1.4±0.01ns 2.1 GB/sec 4.45 6.0±0.34ns 474.7 MB/sec uri/uri_0004b 1.00 1.9±0.02ns 2.5 GB/sec 4.36 8.1±0.08ns 590.5 MB/sec uri/uri_0008b 1.00 1.2±0.08ns 6.8 GB/sec 4.98 6.2±0.10ns 1390.9 MB/sec uri/uri_0016b 1.00 1.8±0.01ns 9.0 GB/sec 2.80 4.9±0.08ns 3.2 GB/sec uri/uri_0032b 1.00 2.8±0.20ns 11.1 GB/sec 2.05 5.7±0.12ns 5.4 GB/sec uri/uri_0064b 1.00 5.5±0.75ns 11.1 GB/sec 1.30 7.1±0.04ns 8.5 GB/sec uri/uri_0256b 1.11 19.1±0.09ns 12.5 GB/sec 1.00 17.2±2.03ns 14.0 GB/sec uri/uri_0512b 1.36 40.4±0.26ns 11.8 GB/sec 1.00 29.8±3.51ns 16.0 GB/sec uri/uri_1024b 1.35 80.6±0.53ns 11.8 GB/sec 1.00 59.9±0.34ns 15.9 GB/sec uri/uri_2048b 1.69 163.7±1.62ns 11.7 GB/sec 1.00 97.1±0.45ns 19.7 GB/sec uri/uri_4096b 1.95 310.9±22.99ns 12.3 GB/sec 1.00 159.4±0.92ns 23.9 GB/sec

I also try to use simdutf8, but not sure we wants a dep for that, result are better on large string, but no really significant on small string :

group no_utf8 simd_utf8 utf8 ----- ------- --------- ---- uri/uri_0001b 1.00 1.1±0.01ns 1786.2 MB/sec 5.42 5.8±0.07ns 329.6 MB/sec 4.58 4.9±0.09ns 390.1 MB/sec uri/uri_0002b 1.00 1.4±0.01ns 2.1 GB/sec 4.80 6.5±0.02ns 440.1 MB/sec 4.45 6.0±0.34ns 474.7 MB/sec uri/uri_0004b 1.00 1.9±0.02ns 2.5 GB/sec 4.34 8.0±0.44ns 593.1 MB/sec 4.36 8.1±0.08ns 590.5 MB/sec uri/uri_0008b 1.00 1.2±0.08ns 6.8 GB/sec 5.58 6.9±0.06ns 1240.3 MB/sec 4.98 6.2±0.10ns 1390.9 MB/sec uri/uri_0016b 1.00 1.8±0.01ns 9.0 GB/sec 3.47 6.1±0.07ns 2.6 GB/sec 2.80 4.9±0.08ns 3.2 GB/sec uri/uri_0032b 1.00 2.8±0.20ns 11.1 GB/sec 2.48 6.9±0.12ns 4.5 GB/sec 2.05 5.7±0.12ns 5.4 GB/sec uri/uri_0064b 1.00 5.5±0.75ns 11.1 GB/sec 1.23 6.7±0.13ns 9.0 GB/sec 1.30 7.1±0.04ns 8.5 GB/sec uri/uri_0128b 1.21 10.3±0.06ns 11.7 GB/sec 1.00 8.5±1.10ns 14.1 GB/sec 1.19 10.2±0.90ns 11.8 GB/sec uri/uri_0256b 1.72 19.1±0.09ns 12.5 GB/sec 1.00 11.1±0.30ns 21.5 GB/sec 1.54 17.2±2.03ns 14.0 GB/sec uri/uri_0512b 2.02 40.4±0.26ns 11.8 GB/sec 1.00 20.0±0.36ns 23.9 GB/sec 1.49 29.8±3.51ns 16.0 GB/sec uri/uri_1024b 2.48 80.6±0.53ns 11.8 GB/sec 1.00 32.5±1.46ns 29.4 GB/sec 1.84 59.9±0.34ns 15.9 GB/sec uri/uri_2048b 2.70 163.7±1.62ns 11.7 GB/sec 1.00 60.6±1.06ns 31.5 GB/sec 1.60 97.1±0.45ns 19.7 GB/sec uri/uri_4096b 2.77 310.9±22.99ns 12.3 GB/sec 1.00 112.4±1.09ns 33.9 GB/sec 1.42 159.4±0.92ns 23.9 GB/sec

src/lib.rs

joelwurtz · 2024-09-03T09:08:56Z

src/lib.rs

@@ -2053,7 +2087,7 @@ mod tests {
        assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize));
        assert_eq!(parse_chunk_size(b"567f8a\rfoo"), Err(crate::InvalidChunkSize));
        assert_eq!(parse_chunk_size(b"567xf8a\r\n"), Err(crate::InvalidChunkSize));
-        assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, std::u64::MAX))));
+        assert_eq!(parse_chunk_size(b"ffffffffffffffff\r\n"), Ok(Status::Complete((18, u64::MAX))));


unrelated, can remove it, but without this test without default features would fail (since we don't have std)

seanmonstar · 2024-09-12T14:25:57Z

I asked in the referenced issue to check with a few implementers if this is desirable :)

seanmonstar

Thanks! I think we can do this, but some changes commented inline.

Also, to test all SIMD variants, we can probably just have a unit test that loops all possible bytes of URI_MAP, making a long enough vec with that byte, and then checking the match value is the same as without SIMD. I believe we have a test like that for one of the other byte maps.

Cargo.toml

src/lib.rs

joelwurtz · 2024-09-17T17:10:01Z

I have apply your comments and squash, will add a test later that will try to iterate over all utf8 char so we are sure to not miss something

EDIT : I don't think we can iterate over all uri_map since some bytes can be only one byte of a unicode char, so it will fail for some of them during the utf8 check, what i want to do is

iterate over all ascii char 0 - 127 and test it match the intended behavior
iterate over all utf8 char starting from 127 whether they have 1,2 or 3 bytes in them, and also check that not existing utf8 sequence char breaks an error

joelwurtz · 2024-09-18T13:54:44Z

I added a test for this, is this what you want or did i misunderstood ?

joelwurtz · 2024-10-14T07:09:36Z

Hey, do you need something more for this to be reviewed or merged ?

earce-pulse · 2024-10-18T02:01:16Z

+1 looking forward to progress on this!

lovasoa · 2024-11-05T11:32:39Z

@seanmonstar , do you think this can be merged ? This would really help us over at SQLPage.

earce-pulse · 2024-11-05T17:39:22Z

@seanmonstar us as well

joelwurtz force-pushed the feat/allow_non_compliant_char branch from 507fc2a to 6cb1b83 Compare August 29, 2024 11:54

This was referenced Aug 29, 2024

feat(path-and-query): allow utf8 char, not rfc 3986 compliant, in path and query hyperium/http#715

Open

actix-web returns 400 bad request for http requests emitted by many user agents actix/actix-web#3102

Open

joelwurtz force-pushed the feat/allow_non_compliant_char branch from 47380b3 to 6ff4cd6 Compare September 3, 2024 09:05

joelwurtz commented Sep 3, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

joelwurtz commented Sep 3, 2024

View reviewed changes

joelwurtz changed the title ~~feat(path): allow to config parser to allow non compliant rfc3986 support~~ feat(path): allow utf8 chars in path Sep 3, 2024

joelwurtz force-pushed the feat/allow_non_compliant_char branch 2 times, most recently from 37533c4 to df0651e Compare September 10, 2024 08:33

seanmonstar reviewed Sep 17, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

feat(path): allow utf8 chars in path

3f068fd

joelwurtz force-pushed the feat/allow_non_compliant_char branch from a00e411 to 3f068fd Compare September 17, 2024 16:08

feat(utf8) add a test to ensure all utf8 char work

3bc58b2

feat(path): disable miri check for utf8 patch check

c203bae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(path): allow utf8 chars in path #178

feat(path): allow utf8 chars in path #178

joelwurtz commented Aug 29, 2024 •

edited

Loading

joelwurtz commented Sep 3, 2024

joelwurtz Sep 3, 2024 •

edited

Loading

joelwurtz Sep 3, 2024 •

edited

Loading

joelwurtz Sep 3, 2024

joelwurtz Sep 3, 2024

seanmonstar commented Sep 12, 2024

seanmonstar left a comment

joelwurtz commented Sep 17, 2024 •

edited

Loading

joelwurtz commented Sep 18, 2024 •

edited

Loading

joelwurtz commented Oct 14, 2024

earce-pulse commented Oct 18, 2024

lovasoa commented Nov 5, 2024 •

edited

Loading

earce-pulse commented Nov 5, 2024

feat(path): allow utf8 chars in path #178

Are you sure you want to change the base?

feat(path): allow utf8 chars in path #178

Conversation

joelwurtz commented Aug 29, 2024 • edited Loading

joelwurtz commented Sep 3, 2024

joelwurtz Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

joelwurtz Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

joelwurtz Sep 3, 2024

Choose a reason for hiding this comment

joelwurtz Sep 3, 2024

Choose a reason for hiding this comment

seanmonstar commented Sep 12, 2024

seanmonstar left a comment

Choose a reason for hiding this comment

joelwurtz commented Sep 17, 2024 • edited Loading

joelwurtz commented Sep 18, 2024 • edited Loading

joelwurtz commented Oct 14, 2024

earce-pulse commented Oct 18, 2024

lovasoa commented Nov 5, 2024 • edited Loading

earce-pulse commented Nov 5, 2024

joelwurtz commented Aug 29, 2024 •

edited

Loading

joelwurtz Sep 3, 2024 •

edited

Loading

joelwurtz Sep 3, 2024 •

edited

Loading

joelwurtz commented Sep 17, 2024 •

edited

Loading

joelwurtz commented Sep 18, 2024 •

edited

Loading

lovasoa commented Nov 5, 2024 •

edited

Loading