Disclaimer: This is not an official Google product. There is absolutely NO WARRANTY provided for using this code. The code is Apache Licensed and CAN BE fully modified, white labeled, and disassembled by your team.
This repository contains very lite modules and tools for use with advanced data solutions. Specifically it implements the Extract and Load parts of an ETL, allowing BigQuery to be used for the Transform. Imagine all the data in an API appearing as a table in BigQuery for you to query. And then imagine being able to write back to the API using a query so you can change settings.
When you use this repository, restful APIs are turned into tables in BigQuery. You can then write SQL logic to manipulate and analyze those tables to either present in a dashboard, or write back into another API call. See the Wiki for examples.
All Google APIs are supported, our team specifically uses this for:
See the Wiki for how to call each. In addition reporting data helpers exist for:
git clone https://github.com/google/bqflow
python3 -m pip install -r requirements.txt
A workflow is a JSON file that contains API endpoints and parameters. See the Wiki for details examples and details on workflows. You may also receive workflow JSON files from Google when collaborating on a project. The following command will show you how to run a workflow:
python3 bqflow/run.py -h
To execute multiple workflows in parallel, use the following command:
python3 bqflow/schedule_local.py -h
To execute workflows on a schedule within a VM, follow these instructions:
- Create a VM. These are recommended settings:
- Series: E2
- Machine Type: e2-highmem-2
- Boot Disk Size: 10GB is enough, all data is stored in memory.
- Boot Disk Image: Debian GNU/Linux 11 (bullseye) or higher
- Service Account: One you create (see below) or None, depending on setup.
- Firewall: Leave unchecked, there is no need for HTTP/HTTPS.
- Log into the VM, the below step is optional if you get a warning message about logging in:
- Make sure you have at least one VPC Network.
- Enable SSH/IAP Firewall rule for that network, rule is browser ssh compatible:
gcloud compute --project=[PROJECT NAME] firewall-rules create allow-ingress-from-iap --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:22,tcp:3389 --source-ranges=35.235.240.0/20
- Install BQFlow.
- Install Git:
sudo apt-get install git
- Install Pip:
sudo apt-get install python3-pip
- Install BQFlow:
git clone https://github.com/google/bqflow
- Install Requirments:
python3 -m pip install -r bqflow/requirements.txt
- Print These Instructions In VM:
python3 bqflow/schedule_local.py -h
- Create Workflow Directory And Add Workflows:
mkdir workflows
- Run Workflows Manually:
python3 bqflow/schedule_local.py
- Install Git:
- Set up the startup script.
- Log out of the VM.
- Edit the VM and navigate to Management > Automation > Automation, and add:
Find [YOUR USERNAME] on the VM by running
#!/bin/bash sudo -u [YOUR USERNAME] bash -c 'python3 ~/bqflow/schedule_local.py ~/workflows' shutdown -h +1
echo $USER
.
- Set up the schedule tab.
NOTE: To prevent the VM from shutting down when you log in you will have to comment out the startup logic, save, and then log in.
To execute the workflows on a schedule from Google Drive:
- Create a dedicated Service credential.
- Be sure to grant the service the IAM Role roles/bigquery.dataOwner and roles/bigquery.jobUser.
- Create a VM, follow STEP 2 under VM Runner Script, and choose the above service credential.
- STOP the VM, not delete, just stop.
- Select SCOPES to the service account:
- At minimum you will need:
- For advertising products you should consider:
- https://www.googleapis.com/auth/doubleclickbidmanager
- https://www.googleapis.com/auth/doubleclicksearch
- https://www.googleapis.com/auth/analytics
- https://www.googleapis.com/auth/youtube
- https://www.googleapis.com/auth/display-video
- https://www.googleapis.com/auth/ddmconversions
- https://www.googleapis.com/auth/dfareporting
- https://www.googleapis.com/auth/dfatrafficking
- https://www.googleapis.com/auth/analytics.readonly
- https://www.googleapis.com/auth/adwords
- https://www.googleapis.com/auth/adsdatahub
- https://www.googleapis.com/auth/content
- https://www.googleapis.com/auth/cloud-vision
- To apply all the scopes run the following gcloud command from Cloud Shell::
gcloud beta compute instances set-scopes [VM NAME] --scopes='https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/doubleclickbidmanager,https://www.googleapis.com/auth/doubleclicksearch,https://www.googleapis.com/auth/analytics,https://www.googleapis.com/auth/youtube,https://www.googleapis.com/auth/display-video,https://www.googleapis.com/auth/ddmconversions,https://www.googleapis.com/auth/dfareporting,https://www.googleapis.com/auth/dfatrafficking,https://www.googleapis.com/auth/analytics.readonly,https://www.googleapis.com/auth/adwords,https://www.googleapis.com/auth/adsdatahub,https://www.googleapis.com/auth/content,https://www.googleapis.com/auth/cloud-vision' --zone=[ZONE] --service-account=[SERVICE CREDENTIAL EMAIL]`
- Set up the startup script.
- Edit the VM and navigate to Management > Automation > Automation, and add:
Find [YOUR USERNAME] on the VM by running
#!/bin/bash sudo -u [YOUR USERNAME] bash -c 'python3 ~/bqflow/schedule_drive.py [DRIVE FOLDER LINK] -s DEFAULT -p [CLOUD PROJECT ID]' shutdown -h +1
echo $USER
.
- Edit the VM and navigate to Management > Automation > Automation, and add:
- Set up the schedule tab.
- Start adding workflows to your drive folder and sharing with the service email address from step one.
- For security reasons workflows have to be in [DRIVE FOLDER LINK].
- Edit JSON files from your machine using Google Drive For Desktop.
BQFlow can be run with either Service or User credentials. Service credentials are ideal for most workflows however you have the option to use either. Please follow Google Cloud Security Best Practices when handling credentials.
- For Service you have 2 options:
- Keyless, provision credentials and assign to VM, a key is never downloaded but all workflows must run as this service.
- JSON, download the service keys to the VM (or equivalent) and use in combination with specific workflows.
- Be sure to grant the service the IAM Roles roles/bigquery.dataOwner and roles/bigquery.jobUser.
- For User
- Run
python3 bqflow/auth.py -h
and follow instructions.
- Run
- For debugging add the --verbose or -v parameter to any of the commands.
- For production add a log configuration to each workflow file. Change to WRITE_TRUNCATE to replace the log file each time.
Logs are written after each workflow completes. The log table can be included in queries to ensure dashboards or API calls are up to date.
{ "log":{ "bigquery":{ "auth":"service", "dataset":"some_dataset", "table":"BQFlow_Log", "disposition":"WRITE_APPEND" }}, "tasks":[...] }
Why does this exist?
- Enables Google gTech to deliver solutions with 90% less code to maintain.
- Gives you the ability to clone your own version and own the code.
- Eliminates hundreds of custom connectors and maintenance.
- Moves all solution logic to SQL, which aligns with data scientists.
Why BigQuery?
- More accessibility, SQL is easier to learn and use than Python.
- Supports nested JSON structures required by most APIs.
- Has hundreds of functions for manipulating data.
- Allows combining of tables (API endpoints).
- Can be connected to dashboards.
Does it have to run on a VM?
- No, its just Python you can run it anywhere, including local machines and cloud functions.
- We chose a VM because there are no time limits, so workflows can run for hours if necessary.
Why Restful APIS?
- Well documented endpoints for each product.
- Consistent and universal across all products, no client library differences.
- Less to maintain, yes BQFlow only has 1 connector for ALL GOOGLE APIs.
Is it only Google APIs?
- No, any API handler can be added, but our use case is Google.
- More details on how to extend in the Wiki.
Is it cloud agnostic?
- Yes, its just Python code you can run it from anywhere in any cloud.
- No, it writes to and from BigQuery, which is a Google Cloud product.
Is it a framework?
- No, its a Python function for making API calls to and from BigQuery.
- Yes, there is a sample VM startup script you can use to run multiple jobs.