Skip to content

Latest commit

 

History

History
250 lines (212 loc) · 7.05 KB

README.md

File metadata and controls

250 lines (212 loc) · 7.05 KB

mattermost-data-warehouse

This repo contains the code for loading data to Mattermost's data warehouse, performing transformations and exporting data to other tools.

Repository structure

.
├── airflow                 <--- Airflow related code.
│  ├── dags                 <--- DAGs executed by airflow.
│  ├── plugins              <--- Custom airflow plugins used with DAGs.
│  └── tests                <--- Tests for dags and plugins.
├── docs                    <--- Extra documentation.
├── extract                 <--- Python scripts that extract data from various locations.
│  └── s3_extract           <--- Various utilities for importing data from S3 to Snowflake
├── load        
│  └── snowflake            <--- Snowflake role definitions
├── README.md               <--- This file
├── poetry.lock             <--- Pinned dependency versions
├── pyproject.toml          <--- PEP-518 build system requirements
├── requirements.txt        <--- Dependencies (deprecated)
├── tests                   <--- Unit tests for python code.
├── transform
│  ├── snowflake-dbt        <--- Snowflake DBT models.
│  ├── mattermost-analytics <--- New DBT project for Mattermost analytics.
│  └── sql                  <--- SQL scripts that get executed by DAGs
└── utils                   <--- Various Python scripts

What does this repo contain?

Extract

Tools for extracting data

  • from S3 to Snowflake and
  • from Snowflake to Postgres.

All extractions are executed using Airflow on a Kubernetes cluster.

Transform

  • DBT project for running transformations on Snowflake. The DBT project runs on DBT cloud.
  • SQL scripts for NPS feedback. The scripts are executed by an Airflow DAG.

Load

Snowflake role definitions. An Airflow DAG runs the update.

Billing

Trigger building invoice for subscriptions. Triggered via an Airflow DAG.

Utils

A set of Python scripts performing custom ETLs. The utilities run as part of Airflow DAGs.

DAGs

Airflow DAGs that orchestrate ETL pipelines.

DBT setup & development

Please see DBT setup instructions for setting up DBT and for performing common operations.

DBT development guidelines contains instructions about the DBT development environment, as well as for common development operations.

Python Development

Requirements

Install dependencies

Run the following commands:

poetry install

Adding dependencies

Additional dependencies can be specified at pyproject.toml. Poetry's documentation provides examples. Please prefer using poetry CLI, as it also updates poetry.lock file and "pins" any new dependencies.

Note that currently there's a requirements.txt file. This file will be deprecated.

Configuration

Snowflake connections

Snowflake connection details can be configured by adding the proper environment variables. Each role requires a different set of environment variables. The following table describes the required environment variables for each role:

Role Environment variable Description
SYSADMIN SNOWFLAKE_USER Username
SNOWFLAKE_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_LOAD_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
ANALYTICS_LOADER SNOWFLAKE_LOAD_USER Username
SNOWFLAKE_LOAD_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_TRANSFORM_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
LOADER SNOWFLAKE_LOAD_USER Username
SNOWFLAKE_LOAD_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_LOAD_DATABASE Database to load data to
SNOWFLAKE_LOAD_WAREHOUSE Warehouse to load data to
TRANSFORMER SNOWFLAKE_USER Username
SNOWFLAKE_PASSWORD Password
SNOWFLAKE_ACCOUNT Snowflake account to connect to
SNOWFLAKE_TRANSFORM_DATABASE Database to use for transforming data
SNOWFLAKE_TRANSFORM_WAREHOUSE Warehouse to store transformed data to
PERMISSIONS PERMISSION_BOT_USER Username
PERMISSION_BOT_PASSWORD Password
PERMISSION_BOT_ACCOUNT Snowflake account to connect to
PERMISSION_BOT_DATABASE Database to use for transforming data
PERMISSION_BOT_WAREHOUSE Warehouse to store transformed data to
RELEASE_LOCATION Location to load release data from

Airflow

Please see the separate documentation file.

Notes

This product includes GeoLite2 data created by MaxMind, available from https://www.maxmind.com.