Monday, June 22, 2026

Priorities

Work Log


GitLab to GitHub Migration — DA-1346, DA-1343

Situation

I continue the migration process to migrate our repositories from GitLab to GitHub. (details)

Today, I planned to migrate two repositories and its CI/CD pipelines from GitLab to GitHub

Task

Migrate the following GitLab repositories to GitHub. 1. gitlab warehouseflow 2. gitlab etl-batch

Action

I migrated the repositories with the following command.

git clone --mirror git@gitlab.com:allofresh/data/<repo-name>.git
cd <repo-name>.git
git remote add origin git@github.com:allofresh/<repo-name>.git
git push origin --force --all
git push origin --force --tags

After that, I manually converted the CI/CD pipeline from GitLab CI/CD to GitHub Actions.

Result

As a result. the following repositories and ci/cd pipeline has been migrated to GitHub. 1. github warehouseflow 2. github etl-batch


Insider Data Pipeline — Data Duplication Issue

Situation

The Insider team informed us that their integration system produce duplicated data. They sent the CSV files to our cloud storage bucket twice. Only CSV files whose names contain the 2531 code are impacted, and they all have exactly the same size. If this happens I could deduplicate the data by session_id, event_name, user_id , timestamp fields.

Task

Find out the duplicated files in the cloud storage.

Action

I manually validate the duplicated files by eye-balling since I only need to check from partition 4/6/26 until 22/6/26.

Result

Fortunately, There is no duplicated file so that I don't need to do deduplication process.


Blockers

N/A

Carry-overs

Reflection

N/A