Thursday, June 18, 2026

Priorities

Migrate Git repository, and it's CICD pipeline from GitLab to GitHub — DA-1270
Create a live demo for LLM Inference

Work Log

LLM Inference — Embedded Service

Situation

I planned to create an LLM Inference live demo to showcase my practical skills in MLOps. The scope of this section is limited to established embedded service that is able to vectorize the input. Furthermore, the output of this service can be stored in the vector database and later used as a source for retrieval in downstream task.

Task

Enable an embedded service that is accessible by everyone.

Action

I created Kubernetes manifest files consisting of: namespace, deployment, service, etc. After that, I applied them using the kubectl apply command. Find the manifest details below:

Click to expand

# 1. Create a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: embed-server

---

# 2. Create a service
apiVersion: v1
kind: Service
metadata:
  name: server
  namespace: embed-server
spec:
  selector:
    app: server
  ports:
    - port: 80
      targetPort: 8080

---

# 3. Create a sertificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ilyasahsan-xyz
  namespace: embed-server
spec:
  secretName: ilyasahsan-xyz-tls
  dnsNames:
    - ilyasahsan.xyz
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

---

# 4. Create a deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: server
  namespace: embed-server
spec:
  selector:
    matchLabels:
      app: server
  template:
    metadata:
      labels:
        app: server
    spec:
      containers:
        - name: server
          image: ghcr.io/ggml-org/llama.cpp:server
          ports:
            - containerPort: 8080
          env:
            - name: LLAMA_ARG_MODEL_URL
              value: https://huggingface.co/ilyasahsan/GGUF/resolve/main/all-MiniLM-L6-v2-f16.gguf
            - name: LLAMA_ARG_EMBEDDINGS
              value: "1"
            - name: LLAMA_ARG_CTX_SIZE
              value: "256"
            - name: LLAMA_ARG_UI
              value: "0"
          resources:
            requests:
              cpu: "200m"
              memory: "200Mi"
            limits:
              cpu: "250m"
              memory: "256Mi"

---

# 5. Create the ingress
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: embed-server-ratelimit
  namespace: embed-server
spec:
  rateLimit:
    average: 10
    burst: 2
    period: 5m
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: embed-server-stripprefix
  namespace: embed-server
spec:
  stripPrefix:
    prefixes:
      - /embed-server
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: embed-server
  namespace: embed-server
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`ilyasahsan.xyz`) && PathPrefix(`/embed-server`)
      kind: Rule
      middlewares:
        - name: embed-server-stripprefix
        - name: embed-server-ratelimit
          namespace: embed-server
      services:
        - name: server
          port: 80
  tls:
    secretName: ilyasahsan-xyz-tls

Result

The embedded service is now accessible to everyone and can be tried with the following command:

Health check

curl https://ilyasahsan.xyz/embed-server/health

Embeddings — Single input

curl -X POST 'https://ilyasahsan.xyz/embed-server/v1/embeddings' \
-H "Content-Type: application/json" \
-d '{"input":"hello world","model":"all-MiniLM-L6-v2"}'

Embeddings — Batch input

curl -X POST 'https://ilyasahsan.xyz/embed-server/v1/embeddings' \
-H "Content-Type: application/json" \
-d '{"input":["first sentence","second sentence","third sentence"]}'

Tokenize

curl -X POST 'https://ilyasahsan.xyz/embed-server/tokenize' \
-H "Content-Type: application/json" \
-d '{"content": "andrew kelley"}'

Detokenize

curl -X POST 'https://ilyasahsan.xyz/embed-server/detokenize' \
-H "Content-Type: application/json" \
-d '{"tokens": [4080,19543]}'

GitLab to GitHub — Config Manifest — DA-1270

TL;DR; This task is finished. As a result, our operations for enabling tables in BigQuery have been moved to GitHub.

Find the details of this log in this page.

Blockers

N/A

Carry-overs

Create a live demo for LLM Inference

Reflection

N/A