Thursday, June 18, 2026
Priorities
- Migrate Git repository, and it's CICD pipeline from GitLab to GitHub — DA-1270
- Create a live demo for LLM Inference
Work Log
LLM Inference — Embedded Service
Situation: I planned to create an LLM Inference live demo to showcase my practical skills in MLOps. The scope of this section is limited to established embedded service that is able to vectorize the input. Furthermore, the output of this service can be stored in the vector database and later used as a source for retrieval in downstream task.
Task: Enable an embedded service that is accessible by everyone.
Action:
I created Kubernetes manifest files consisting of: namespace, deployment, service, etc.
After that, I applied them using the kubectl apply command.
Find the manifest details below:
Click to expand
# 1. Create a namespace
apiVersion: v1
kind: Namespace
metadata:
name: embed-server
---
# 2. Create a service
apiVersion: v1
kind: Service
metadata:
name: server
namespace: embed-server
spec:
selector:
app: server
ports:
- port: 80
targetPort: 8080
---
# 3. Create a sertificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: ilyasahsan-xyz
namespace: embed-server
spec:
secretName: ilyasahsan-xyz-tls
dnsNames:
- ilyasahsan.xyz
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
---
# 4. Create a deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: server
namespace: embed-server
spec:
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
containers:
- name: server
image: ghcr.io/ggml-org/llama.cpp:server
ports:
- containerPort: 8080
env:
- name: LLAMA_ARG_MODEL_URL
value: https://huggingface.co/ilyasahsan/GGUF/resolve/main/all-MiniLM-L6-v2-f16.gguf
- name: LLAMA_ARG_EMBEDDINGS
value: "1"
- name: LLAMA_ARG_CTX_SIZE
value: "256"
- name: LLAMA_ARG_UI
value: "0"
resources:
requests:
cpu: "200m"
memory: "200Mi"
limits:
cpu: "250m"
memory: "256Mi"
---
# 5. Create the ingress
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: embed-server-ratelimit
namespace: embed-server
spec:
rateLimit:
average: 10
burst: 2
period: 5m
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: embed-server-stripprefix
namespace: embed-server
spec:
stripPrefix:
prefixes:
- /embed-server
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: embed-server
namespace: embed-server
spec:
entryPoints:
- websecure
routes:
- match: Host(`ilyasahsan.xyz`) && PathPrefix(`/embed-server`)
kind: Rule
middlewares:
- name: embed-server-stripprefix
- name: embed-server-ratelimit
namespace: embed-server
services:
- name: server
port: 80
tls:
secretName: ilyasahsan-xyz-tls
Result
The embedded service is now accessible to everyone and can be tried with the following command:
Health check
curl https://ilyasahsan.xyz/embed-server/health
Embeddings — Single input
curl -X POST 'https://ilyasahsan.xyz/embed-server/v1/embeddings' \
-H "Content-Type: application/json" \
-d '{"input":"hello world","model":"all-MiniLM-L6-v2"}'
Embeddings — Batch input
curl -X POST 'https://ilyasahsan.xyz/embed-server/v1/embeddings' \
-H "Content-Type: application/json" \
-d '{"input":["first sentence","second sentence","third sentence"]}'
Tokenize
curl -X POST 'https://ilyasahsan.xyz/embed-server/tokenize' \
-H "Content-Type: application/json" \
-d '{"content": "andrew kelley"}'
Detokenize
curl -X POST 'https://ilyasahsan.xyz/embed-server/detokenize' \
-H "Content-Type: application/json" \
-d '{"tokens": [4080,19543]}'
GitLab to GitHub — Config Manifest — DA-1270
TL;DR; This task is finished. As a result, our operations for enabling tables in BigQuery have been moved to GitHub.
Find the details of this log in this page.
Blockers
N/A
Carry-overs
- Create a live demo for LLM Inference
Reflection
N/A