Skip to content

Latest commit

 

History

History
88 lines (66 loc) · 2.35 KB

File metadata and controls

88 lines (66 loc) · 2.35 KB

Working with Amazon S3

S3 CLI Reference

Using the AWS command-line tools, practice with the following exercises:

Make a bucket:

aws s3 mb s3://<USERID>-data
# for example s3://nem2p-data

List all your buckets:

aws s3 ls

Copy a single file from the instructor's bucket into your own:

aws s3 cp s3://uvasds-data/taxi/yellow_tripdata_2025-11.parquet s3://nem2p-ds5220-data

Copy all files matching a pattern from the instructor's bucket into your own:

aws s3 cp s3://uvasds-data/taxi/ s3://nem2p-ds5220-data --recursive --exclude "*" --include "*.parquet" 

List all files in a bucket or subfolder of a bucket:

aws s3 ls s3://nem2p-ds5220-data/

Sync objects from a source to a destination

aws s3 sync SOURCE DESTINATION

# Sync objects from a source and remove files on the destination
# that do not exist in the source

aws s3 sync . s3://amzn-s3-demo-bucket --delete

Presign a URL to a private file that expires in 60 seconds:

# know the S3 URI for a known file that already exists in your bucket
aws s3 presign --expires-in 60 s3://nem2p-ds5220-data/yellow_tripdata_2025-11.parquet

This returns a signed URL to the object: https://nem2p-ds5220-data.s3.us-east-1.amazonaws.com/yellow_tripdata_2025-11.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAWNJE4XNUL4LXBHTQ%2F20260210%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260210T195611Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=1faed501ebe70dfc54a0fdf7f7b7db27cc2709cb766e5778c1a1442dc82faaff

DuckDB

Install the DuckDB CLI

Remote Queries

-- set up credentials
D SET s3_use_ssl=true;
D CALL load_aws_credentials();

-- a simple select from an S3 object using duckdb
-- UPDATE this s3 URI to the bucket you own
D select * from 's3://uvasds-data/taxi/yellow_tripdata_2025-11.parquet';

Reusable Views

CREATE VIEW my_s3_data AS 
SELECT * FROM 's3://bucket/data/*.parquet';

-- Now query the view
SELECT * FROM my_s3_data WHERE condition;

Glob Patterns

-- select from all objects matching a pattern
-- UPDATE this s3 URI to the bucket you own
select count(*) from 's3://uvasds-data/taxi/*.parquet';

Other

-- Hive-partitioned data
SELECT * FROM 's3://bucket/data/year=*/month=*/*.parquet';