You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:
s3://nrel-pds-wtk/bchrrr/v1.0.0
s3://nrel-pds-wtk/canada/v1.0.0
You can’t treat this like an FTP directory and browse to the corresponding URL (e.g. https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/) to get a list of files (depending on bucket configuration, you can do this at the root of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.
The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:
Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python: list_objects_v2 or S3.Paginator.ListObjectsV2).
At its most simple, I think we just need something to:
Run on a set schedule (e.g. 1/day).
Iterate over each S3 prefix we want to monitor.
List all the files, sizes, and modified-times with the given prefix.
Handle situations where the prefix disappears entirely.
Save the results somewhere so they can be compared, run to run.
The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.
Ideally, this might also have something to send people an e-mail or Slack message if files are deleted.
Fancier stuff that would probably be useful:
Produce a WARC of some sort so other archiving tools can consume it in a semi-standardized way. Upload this WARC to Internet Archive.
Produce some readable text or HTML output that we could import into the Web Monitoring database and display at monitoring.envirodatagov.org (and therefore also include in weekly analyst sheets).
The text was updated successfully, but these errors were encountered:
Folks at PEDP would like to keep an eye on the status of a bunch of datasets that are a collection of files in an S3 bucket with a common prefix. For example:
s3://nrel-pds-wtk/bchrrr/v1.0.0
s3://nrel-pds-wtk/canada/v1.0.0
You can’t treat this like an FTP directory and browse to the corresponding URL (e.g.
https://nrel-pds-wtk.s3.amazonaws.com/bchrrr/v1.0.0/
) to get a list of files (depending on bucket configuration, you can do this at the root of the bucket, but nowhere else). Instead, we need to develop something more specialized to keep an eye on these files.The AWS CLI can already do a lot of this for us, so we may just need some scripting around that:
> aws s3 ls --summarize --recursive s3://nrel-pds-wtk/bchrrr/v1.0.0/ 2024-11-19 12:43:09 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2015.h5 2024-11-19 12:55:32 1784947575862 bchrrr/v1.0.0/bchrrr_conus_2016.h5 2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2017.h5 2024-11-19 12:55:32 1779388467382 bchrrr/v1.0.0/bchrrr_conus_2018.h5 2024-11-19 12:55:32 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2019.h5 2024-11-19 18:56:10 1784947571766 bchrrr/v1.0.0/bchrrr_conus_2020.h5 2024-11-19 19:52:51 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2021.h5 2024-11-19 19:53:54 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2022.h5 2024-11-19 19:54:02 1779388463286 bchrrr/v1.0.0/bchrrr_conus_2023.h5 Total Objects: 9 Total Size: 16025614402918
Alternatively, you could use some client library to do this and wrap in some custom code to produce some kind of fancier output (e.g. in Python:
list_objects_v2
orS3.Paginator.ListObjectsV2
).At its most simple, I think we just need something to:
The plainest version of this might just be a GitHub action that runs and saves the results to a GH repo, where someone can see diffs between different runs.
Ideally, this might also have something to send people an e-mail or Slack message if files are deleted.
Fancier stuff that would probably be useful:
The text was updated successfully, but these errors were encountered: