Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tool to monitor S3 prefixes #177

Open
Mr0grog opened this issue Feb 25, 2025 · 0 comments
Open

Create tool to monitor S3 prefixes #177

Mr0grog opened this issue Feb 25, 2025 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 25, 2025

Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:

  • s3://nrel-pds-wtk/bchrrr/v1.0.0
  • s3://nrel-pds-wtk/canada/v1.0.0

You can’t treat this like an FTP directory and browse to the corresponding URL (e.g. https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/) to get a list of files (depending on bucket configuration, you can do this at the root of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.

The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:

> aws s3 ls --summarize --recursive s3://nrel-pds-wtk/bchrrr/v1.0.0/
2024-11-19 12:43:09 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2015.h5
2024-11-19 12:55:32 1784947575862 bchrrr/v1.0.0/bchrrr_conus_2016.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2017.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2018.h5
2024-11-19 12:55:32 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2019.h5
2024-11-19 18:56:10 1784947571766 bchrrr/v1.0.0/bchrrr_conus_2020.h5
2024-11-19 19:52:51 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2021.h5
2024-11-19 19:53:54 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2022.h5
2024-11-19 19:54:02 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2023.h5

Total Objects: 9
   Total Size: 16025614402918

Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python: list_objects_v2 or S3.Paginator.ListObjectsV2).

At its most simple, I think we just need something to:

  • Run on a set schedule (e.g. 1/day).
  • Iterate over each S3 prefix we want to monitor.
  • List all the files, sizes, and modified-times with the given prefix.
  • Handle situations where the prefix disappears entirely.
  • Save the results somewhere so they can be compared, run to run.

The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.

Ideally, this might also have something to send people an e-mail or Slack message if files are deleted.

Fancier stuff that would probably be useful:

  • Produce a WARC of some sort so other archiving tools can consume it in a semi-standardized way. Upload this WARC to Internet Archive.
  • Produce some readable text or HTML output that we could import into the Web Monitoring database and display at monitoring.envirodatagov.org (and therefore also include in weekly analyst sheets).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant