Skip to content

Latest commit

 

History

History

dbt and Redshift for Analytics Engineering

Python dbt Redshift

License

This project is meant for experimenting with dbt and the dbt-redshift adapter for Analytics, using NYC TLC Trip Record dataset as the datasource, with Kimball dimensional modeling technique.

IMPORTANT NOTE: To access awsdatacatalog from RedShift, IAM auth method is required. It also explicitly needs USAGE grants that DB, therefore, on Redshift Query Editor, run:

GRANT USAGE ON DATABASE awsdatacatalog to "IAM:my_iam_user";
GRANT ALL ON DATABASE <DATABASE_NAME> to "IAM:my_iam_user";

Tech Stack

Up and Running

Developer Setup

1. Install the dependencies on pyproject.toml:

uv sync

2. Activate the virtualenv created by uv:

source .venv/bin/activate

3. (Optional) Install pre-commit:

brew install pre-commit

From root folder where .pre-commit-config.yaml is located, run:

pre-commit install

4. Setup dbt profiles.yaml accordingly (use the profiles.tmpl.yaml as template)

4.1. By default, the profiles_dir is the user '$HOME/.dbt/'

mkdir -p ~/.dbt/
cat profiles.tmpl.yml >> ~/.dbt/profiles.yml

4.2. Set the environment variables for dbt-bigquery:

export DBT_REDSHIFT_HOST=redshift.[id].[region].redshift-serverless.amazonaws.com
export DBT_REDSHIFT_DATABASE=dev
export DBT_REDSHIFT_USE_DATA_CATALOG=1
export DBT_REDSHIFT_SOURCE_GLUE_CATALOG_DB=raw_nyc_tlc_tripdata
export DBT_REDSHIFT_TARGET_SCHEMA=nyc_tlc_record_data

4.3. Also, either have your AWS credentials set on ~/.aws/credentials or set them as well:

export AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY

5. Install dbt dependencies and trigger the pipeline

5.1. Run dbt deps to install dbt plugins

dbt deps

5.2. Run dbt seed to push/create the tables from the .csv seed files to the target schema

dbt seed

5.3. Run dbt run to trigger the dbt models to run

dbt build

# Alternatively you can run only a subset of the models with:

## +models/staging: Runs the dependencies/preceding models first that lead 
## to 'models/staging', and then the target models
dbt [build|run] --select +models/staging

## models/staging+: Runs the target models first, and then all models that depend on it
dbt [build|run] --select models/staging+

6. Generate the Docs and the Data Lineage graph with:

dbt docs generate
dbt docs serve

Access the generated docs at:

open http://localhost:8080

Containerization and Testing

1. Build the Docker Image with:

docker build -t dbt-redshift:latest . --no-cache

2. Start a container with it:

docker run -d --rm \
  -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY} \
  -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
  -e DBT_REDSHIFT_HOST=${DBT_REDSHIFT_HOST} \
  -e DBT_REDSHIFT_DATABASE=dev \
  -e DBT_REDSHIFT_USE_DATA_CATALOG=1 \
  -e DBT_REDSHIFT_SOURCE_GLUE_CATALOG_DB=raw_nyc_tlc_tripdata \
  -e DBT_REDSHIFT_TARGET_SCHEMA=nyc_tlc_record_data \
  --name dbt-redshift \
  dbt-redshift

TODO:

  • PEP-517: Packaging and dependency management with uv
  • Bootstrap dbt with Redshift Adapter (dbt-redshift)
  • Add dbt macro to configure target schemas dinamically
  • Run dbt-core in Docker
  • Terraform AWS Glue Catalog and Crawler