Create tool to monitor S3 prefixes #177

Mr0grog · 2025-02-25T17:08:25Z

Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:

s3://nrel-pds-wtk/bchrrr/v1.0.0
s3://nrel-pds-wtk/canada/v1.0.0

You can’t treat this like an FTP directory and browse to the corresponding URL (e.g. https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/) to get a list of files (depending on bucket configuration, you can do this at the root of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.

The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:

> aws s3 ls --summarize --recursive s3://nrel-pds-wtk/bchrrr/v1.0.0/
2024-11-19 12:43:09 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2015.h5
2024-11-19 12:55:32 1784947575862 bchrrr/v1.0.0/bchrrr_conus_2016.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2017.h5
2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2018.h5
2024-11-19 12:55:32 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2019.h5
2024-11-19 18:56:10 1784947571766 bchrrr/v1.0.0/bchrrr_conus_2020.h5
2024-11-19 19:52:51 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2021.h5
2024-11-19 19:53:54 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2022.h5
2024-11-19 19:54:02 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2023.h5

Total Objects: 9
   Total Size: 16025614402918

Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python: list_objects_v2 or S3.Paginator.ListObjectsV2).

At its most simple, I think we just need something to:

Run on a set schedule (e.g. 1/day).
Iterate over each S3 prefix we want to monitor.
List all the files, sizes, and modified-times with the given prefix.
Handle situations where the prefix disappears entirely.
Save the results somewhere so they can be compared, run to run.

The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.

Ideally, this might also have something to send people an e-mail or Slack message if files are deleted.

Fancier stuff that would probably be useful:

Produce a WARC of some sort so other archiving tools can consume it in a semi-standardized way. Upload this WARC to Internet Archive.
Produce some readable text or HTML output that we could import into the Web Monitoring database and display at monitoring.envirodatagov.org (and therefore also include in weekly analyst sheets).

The text was updated successfully, but these errors were encountered:

Mr0grog added [priority-★★☆] help-wanted labels Feb 25, 2025

github-project-automation bot added this to Web Monitoring Feb 25, 2025

github-project-automation bot moved this to Inbox in Web Monitoring Feb 25, 2025

Mr0grog moved this from Inbox to Prioritized in Web Monitoring Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create tool to monitor S3 prefixes #177

Create tool to monitor S3 prefixes #177

Mr0grog commented Feb 25, 2025 •

edited

Loading

Create tool to monitor S3 prefixes #177

Create tool to monitor S3 prefixes #177

Comments

Mr0grog commented Feb 25, 2025 • edited Loading

Mr0grog commented Feb 25, 2025 •

edited

Loading