Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List objects with unicode - should sort keys using byte-by-byte order and not using utf8 sort order #8218

Open
guymguym opened this issue Jul 21, 2024 · 1 comment · May be fixed by #8324
Open
Labels
S3-Compatibility S3 Compatibility and Namespace over AWS

Comments

@guymguym
Copy link
Member

Environment info

  • NooBaa Version: 5.17
  • Platform: Any

Actual behavior

  1. We always read object keys into js strings, which are UTF8 encoded, and then sorted.
  2. However AWS S3 sorts keys in their "binary" form and compares it byte-by-byte.
  3. We can see that both empirically and it is hinted in the AWS docs.
  4. See this link with a simple test example that I checked and indeed AWS returns the "binary order" while noobaa returns the "UTF8 order".
  5. AWS docs hint that this is the behavior that they use by saying it uses "binary order" - "List results are always returned in UTF-8 binary order." see https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html.

Expected behavior

  1. Sort order of ListObjects with unicode should be compatible with AWS and not rely to UTF8 sort order.
  2. We can load the string into a buffer and use Buffer.compare instead - the concern is just the amount of work and GC it addes to the listing flow, so we should try to minimize this overhead.

Steps to reproduce

  1. Here is a copy of the flow described in: https://forum.moonwalkinc.com/t/determining-s3-listing-order/116

Some third-party implementations of Amazon’s S3 protocol return object information (‘file listings’) in UTF-16 code-unit order rather than the Amazon-compatible Unicode code-point order.

Introduced in Moonwalk 2023.2, when configuring Moonwalk’s s3generic:// plugin (as well as certain other plugins that provide 3rd party S3 support such as s3cos://), a ‘UTF-16 listing order work-around’ option is provided in the Plugin Configuration panel to allow Moonwalk to correctly process results returned in this non-standard order and thereby allow correct and complete scanning of your S3 buckets.

How do you determine whether you need to enable this option?
The following experiment will test the sort order of your S3-compatible device.

Create a new folder on a Windows server with Moonwalk Agent installed
Add files with the EXACT names shown below - use cut & paste to get them right
file_ꦏ_1.txt
file__2.txt
file__3.txt
file_𐎣_4.txt
Don’t worry about the order that Windows shows the files in and don’t worry if some programs just show the characters between the underscores as a box or a question mark etc
Use an Ingest policy to upload this folder to a test bucket on your S3-compatible storage
Use a Gather Statistics policy to scan the location to which you just ingested the files
a. Tick ‘Export raw file metadata’
b. Untick the ‘Compress (gzip)’ option
c. Choose ‘CSV’ format
Check the exported CSV data (e.g. using notepad) to determine the order in which the files appear:
If the files appear in 1, 2, 3, 4 order: congratulations, your S3-compatible device uses the expected AWS ordering - you should NOT tick the workaround box
If the files appear in 1, 4, 2, 3 order: your device is using UTF-16 code-unit order - you WILL need to tick the ‘UTF-16 listing order work-around’ box
Note: this option does not change the order in which results are actually returned, it just ensures that Moonwalk processes them correctly.

More information - Screenshots / Logs / Other output

@guymguym guymguym added the S3-Compatibility S3 Compatibility and Namespace over AWS label Jul 21, 2024
@tangledbytes tangledbytes linked a pull request Sep 3, 2024 that will close this issue
2 tasks
Copy link

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S3-Compatibility S3 Compatibility and Namespace over AWS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants