Tuesday, June 30, 2026

Priorities

Work Log


Export BigQuery results into Google Drive — DA-1348

Situation

Yesterday, I already created an exploration document about exporting BigQuery query results into Google Drive. Today, I planned to implement the exploration document by creating an example of the integration within the Airflow.

Tasks

Create an Airflow DAG to export BigQuery query results into Google Drive.

Actions

Step 1. Create an Image and push into Artifact Registry.

rclone.conf
[gdrive]
type = drive
scope = drive
service_account_file = <service-account-json>

; https://drive.google.com/drive/folders/<root-folder-id>
root_folder_id = <root-folder-id>
Dockerfile
FROM google/cloud-sdk:alpine

COPY --from=rclone/rclone:latest /usr/local/bin/rclone /usr/local/bin/rclone
RUN apk add --no-cache bash

WORKDIR /app
COPY sync.sh .
COPY rclone.conf .
RUN chmod +x sync.sh
sync.sh
#!/bin/bash

GCS_BUCKET="gs://${BUCKET_ID}"
DRIVE_REMOTE="gdrive"
GCS_GLOB="${GCS_BUCKET}/${USER_ID}/${QUERY_ID}/${QUERY_ID}_*.csv"
QUERY=$(cat "${QUERY_PATH}")

gcloud auth activate-service-account --key-file="${GOOGLE_APPLICATION_CREDENTIALS}"
bq query --use_legacy_sql=false \
"EXPORT DATA OPTIONS (
   uri      = '${GCS_GLOB}',
   format   = 'CSV',
   header   = true,
   overwrite = true
 ) AS
 ${QUERY}"

# If the file size is greater than 100 MB, throw an error.
# This is to ensure the Airflow VM Instance keep running.
MAX_BYTES=$((100 * 1024 * 1024))
TOTAL_BYTES=$(gcloud storage du -s "${GCS_GLOB}" | awk '{print $1}')
if [ "${TOTAL_BYTES}" -gt "${MAX_BYTES}" ]; then
  echo "Total size ${TOTAL_BYTES} bytes exceeds the ${MAX_BYTES} bytes (100 MB) limit." >&2
  exit 1
fi

TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT
gcloud storage cp "${GCS_GLOB}" "${TMP}/"

rclone config show
rclone sync "${TMP}" "${DRIVE_REMOTE}:${USER_ID}/${QUERY_ID}"


Step 2. Create an Airflow DAG, and use the created Image.

...

t1 = DockerOperator(
    task_id='run_task_1',
    image='bigquery_to_gdrive:0.0.9',
    mounts=[
        Mount(source='/secrets', target='/secrets', type='bind'),
        Mount(source='/scripts', target='/scripts', type='bind'),
    ],
    api_version='auto',
    auto_remove=True,
    command="sh /app/sync.sh ",
    docker_url="unix://var/run/docker.sock",
    network_mode="bridge",
    dag=dag,
    environment = {
        'GOOGLE_APPLICATION_CREDENTIALS':'<service-account-json>',
        'USER_ID':'<user-id>',
        'QUERY_ID':'<query-id>',
        'QUERY_PATH':'<query-path>',
        'BUCKET_ID':'<bucket-id>',
        'RCLONE_CONFIG':'<rclone-config>',
    },
    xcom_all=True,
    tty=True,
)

t1

Result

Currently, We have a workflow for exporting the BigQuery query results into Google Drive. Furthermore, I already created a playbook for this workflow so that the data stakeholders can create the workflow by themselves.


Blockers

N/A

Carry-overs

Reflection

N/A