Read/write data from/to Google BigQuery with Dask.
This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.
dask-bigquery
can be installed with pip
:
pip install dask-bigquery
or with conda
:
conda install -c conda-forge dask-bigquery
For reading from BiqQuery, you need the following roles to be enabled on the account:
BigQuery Read Session User
BigQuery Data Viewer
,BigQuery Data Editor
, orBigQuery Data Owner
Alternately, BigQuery Admin
would give you full access to sessions and data.
For writing to BigQuery, the following roles are sufficient:
BigQuery Data Editor
Storage Object Creator
The minimal permissions to cover reading and writing:
BigQuery Data Editor
BigQuery Read Session User
Storage Object Creator
By default, dask-bigquery
will use the Application Default Credentials. When running code locally, you can set this to use your user credentials by running
$ gcloud auth application-default login
User credentials require interactive login. For settings where this isn't possible, you'll need to create a service account. You can set the Application Default Credentials to the service account key using the GOOGLE_APPLICATION_CREDENTIALS
environment variable:
$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.json
For information on obtaining the credentials, use Google API documentation.
dask-bigquery
assumes that you are already authenticated.
import dask_bigquery
ddf = dask_bigquery.read_gbq(
project_id="your_project_id",
dataset_id="your_dataset",
table_id="your_table",
)
ddf.head()
Assuming that client and workers are already provisioned with default credentials:
import dask
import dask_bigquery
ddf = dask.datasets.timeseries(freq="1min")
res = dask_bigquery.to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
)
Before loading data into BigQuery, to_gbq
writes intermediary Parquet to a Google Storage bucket. Default bucket name is <your_project_id>-dask-bigquery
. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket"
. After the job is done, the intermediary data is deleted.
# service account credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}
res = to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
credentials=credentials,
)
To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".
You can run the tests with
$ pytest dask_bigquery
if your default gcloud
project is set, or manually specify the project ID with
DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery
This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.