Friday, June 19, 2026

Priorities

Resolve the Insider ingestion pipeline issue — DA-1349

Work Log

Insider Ingestion Pipeline — The export file is too large — DA-1349

Situation

The Insider ingestion pipeline was failed due to the export file was more than 1GB, and it's prohibited. Additionally, I need to fix this data as soon as possible because the data scientist team needs the Insider data for cutover dashboard queries from Mixpanel to Insider.

Task

Resolve the issue and fix the yesterday data so that it's available in the bigquery.

Action

I modified the script to use a wildcard to export the data into multiple files, and fixed the yesterday data. Find the detail below:

before

json_filename=$(echo "$filename" | sed -e "s/.csv/.json/g")
echo "Extracting to JSON: gs://${BUCKET}/temp/${json_filename}"
bq extract \
--location=asia-southeast2 \
--destination_format NEWLINE_DELIMITED_JSON \
"${TEMP_DATASET_ID}.${TEMP_TABLE_ID}" \
gs://"$BUCKET"/temp/"$json_filename"

after

json_filename=$(echo "$filename" | sed -e "s/.csv/-*.json/g")
echo "Extracting to JSON: gs://${BUCKET}/temp/${json_filename}"
bq extract \
--location=asia-southeast2 \
--destination_format NEWLINE_DELIMITED_JSON \
"${TEMP_DATASET_ID}.${TEMP_TABLE_ID}" \
gs://"$BUCKET"/temp/"$json_filename"

Result

As a result, the yesterday data is now available in the bigquery, and the data scientist team can continue the cutover initiative.

Blockers

N/A

Carry-overs

Create a live demo for LLM Inference

Reflection

I can't sleep last night for no reason, and my beloved team recommended a Magnesium bisglycinate to help me sleep tonight. I tried to fall asleep at 8:30 PM but couldn't, and I swear I don't have a single thing on my mind. This happened for several days this month. I'd like to thank my manager for letting me work from home today.