Sunday, June 21, 2026
Priorities
- Deploy a vector database and a language model server for the LLM inference live demo
Work Log
LLM Inference — Vector Database — ChromaDB
Situation
I planned to create an LLM Inference live demo, and I've already built an embedded service (details: here). The LLM Inference will follow a RAG framework, which uses a vector database to store and retrieve relevant content.
Task
Enable the most lightweight vector database in my kubernetes cluster to prove the live demo works.
Action
I enabled a vector database service that uses ChromaDB. see the Kubernetes manifest files below for details:
namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: chroma
pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: chroma-data
namespace: chroma
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Mi
service.yaml
apiVersion: v1
kind: Service
metadata:
name: chroma
namespace: chroma
spec:
selector:
app: chroma
ports:
- name: http
port: 8000
targetPort: 8000
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chroma
namespace: chroma
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: chroma
template:
metadata:
labels:
app: chroma
spec:
enableServiceLinks: false
containers:
- name: chroma
image: chromadb/chroma:1.5.9
ports:
- containerPort: 8000
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: chroma-data
After that, apply the manifest files to the Kubernetes cluster using the command below:
kubectl apply -f <path-to-manifest-directory>
Result
As a result, the vector database is now available in my Kubernetes cluster. Follow the methods below to test the database.
Step 1. Use Port Forwarding to access the vector database:
kubectl port-forward <chroma-pod-name> 8000:8000 -n chroma
Step 2. Create a Python script to validate the vector database:
import chromadb
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection(name="my_collection")
collection.add(
ids=["zig-1", "zig-2", "zig-3", "rag-1", "rag-2", "k8s-1"],
documents=[
"Zig is a fast programming language.",
"Zig keeps the code clear and simple.",
"Zig can build for many systems.",
"RAG helps the model use real data.",
"Vector databases find related text quickly.",
"ChromaDB is small and easy to run.",
]
)
# Show the stored collection contents
results = collection.get(include=["documents"])
for doc_id, document in zip(results["ids"], results["documents"]):
print(f"{doc_id}: {document}")
LLM Inference — LLM Server — SmolLM2-135M-Instruct
Situation
The LLM Inference live demo is almost done. The embedded service and the vector database are now running. I need a language model that able to give response naturally for the user questions. I also need the most lightweight language models that can run with a few resources in my Kubernetes cluster.
Task
Find the most lightweight language model. After that, enable it in my Kubernetes cluster.
Action
Find the details on the methods below:
Step 1: Find the most lightweight language model.
I found that the SmolLM2-135M-Instruct language model can run with just 500m CPU and 256Mi memory.
Step 2: Convert the language model to the GGUF format.
Execute the command below:
hf download HuggingFaceTB/SmolLM2-135M-Instruct --local-dir SmolLM2-135M-Instruct
python convert_hf_to_gguf.py SmolLM2-135M-Instruct --outfile ./SmolLM2-135M-Instruct.gguf --outtype f16
After that, upload the output file into my Hugging Face repository.
Step 3: Create the following Kubernetes manifest files, and apply into my Kubernetes cluster.
namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: llm-server
certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: ilyasahsan-xyz
namespace: llm-server
spec:
secretName: ilyasahsan-xyz-tls
dnsNames:
- ilyasahsan.xyz
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: server
namespace: llm-server
spec:
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
containers:
- name: server
image: ghcr.io/ggml-org/llama.cpp:server
ports:
- containerPort: 8080
env:
- name: LLAMA_ARG_MODEL_URL
value: https://huggingface.co/ilyasahsan/GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf
- name: LLAMA_ARG_CTX_SIZE
value: "512"
- name: LLAMA_ARG_UI
value: "0"
resources:
requests:
cpu: "250m"
memory: "200Mi"
limits:
cpu: "500m"
memory: "256Mi"
ingress.yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: llm-server-ratelimit
namespace: llm-server
spec:
rateLimit:
average: 10
burst: 2
period: 5m
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: llm-server-stripprefix
namespace: llm-server
spec:
stripPrefix:
prefixes:
- /llm-server
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: llm-server
namespace: llm-server
spec:
entryPoints:
- websecure
routes:
- match: Host(`ilyasahsan.xyz`) && PathPrefix(`/llm-server`)
kind: Rule
middlewares:
- name: llm-server-stripprefix
- name: llm-server-ratelimit
namespace: llm-server
services:
- name: server
port: 80
tls:
secretName: ilyasahsan-xyz-tls
service.yaml
apiVersion: v1
kind: Service
metadata:
name: server
namespace: llm-server
spec:
selector:
app: server
ports:
- port: 80
targetPort: 8080
Result
As a result the LLM server is now running and you can test with following command:
curl -X POST https://ilyasahsan.xyz/llm-server/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Answer the question."
},
{
"role": "user",
"content": "Hello! What is the nationality of Lionel Messi?"
}
],
"temperature": 0.2
}'
The response of the LLM server is likely below:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Lionel Messi is a football player from Argentina."
}
}
],
....
}
Blockers
N/A
Carry-overs
- Create a live demo for LLM Inference
Reflection
N/A