Sunday, June 21, 2026

Priorities

Deploy a vector database and a language model server for the LLM inference live demo

Work Log

LLM Inference — Vector Database — ChromaDB

Situation

I planned to create an LLM Inference live demo, and I've already built an embedded service (details: here). The LLM Inference will follow a RAG framework, which uses a vector database to store and retrieve relevant content.

Task

Enable the most lightweight vector database in my kubernetes cluster to prove the live demo works.

Action

I enabled a vector database service that uses ChromaDB. see the Kubernetes manifest files below for details:

namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: chroma

pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: chroma-data
  namespace: chroma
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Mi

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: chroma
  namespace: chroma
spec:
  selector:
    app: chroma
  ports:
    - name: http
      port: 8000
      targetPort: 8000

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chroma
  namespace: chroma
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: chroma
  template:
    metadata:
      labels:
        app: chroma
    spec:
      enableServiceLinks: false
      containers:
        - name: chroma
          image: chromadb/chroma:1.5.9
          ports:
            - containerPort: 8000
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: chroma-data

After that, apply the manifest files to the Kubernetes cluster using the command below:

kubectl apply -f <path-to-manifest-directory>

Result

As a result, the vector database is now available in my Kubernetes cluster. Follow the methods below to test the database.

Step 1. Use Port Forwarding to access the vector database:

kubectl port-forward <chroma-pod-name> 8000:8000 -n chroma

Step 2. Create a Python script to validate the vector database:

import chromadb

client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection(name="my_collection")

collection.add(
    ids=["zig-1", "zig-2", "zig-3", "rag-1", "rag-2", "k8s-1"],
    documents=[
        "Zig is a fast programming language.",
        "Zig keeps the code clear and simple.",
        "Zig can build for many systems.",
        "RAG helps the model use real data.",
        "Vector databases find related text quickly.",
        "ChromaDB is small and easy to run.",
    ]
)

# Show the stored collection contents
results = collection.get(include=["documents"])
for doc_id, document in zip(results["ids"], results["documents"]):
    print(f"{doc_id}: {document}")

LLM Inference — LLM Server — SmolLM2-135M-Instruct

Situation

The LLM Inference live demo is almost done. The embedded service and the vector database are now running. I need a language model that able to give response naturally for the user questions. I also need the most lightweight language models that can run with a few resources in my Kubernetes cluster.

Task

Find the most lightweight language model. After that, enable it in my Kubernetes cluster.

Action

Find the details on the methods below:

Step 1: Find the most lightweight language model.

I found that the SmolLM2-135M-Instruct language model can run with just 500m CPU and 256Mi memory.

Step 2: Convert the language model to the GGUF format.

Execute the command below:

hf download HuggingFaceTB/SmolLM2-135M-Instruct --local-dir SmolLM2-135M-Instruct
python convert_hf_to_gguf.py SmolLM2-135M-Instruct --outfile ./SmolLM2-135M-Instruct.gguf --outtype f16

After that, upload the output file into my Hugging Face repository.

Step 3: Create the following Kubernetes manifest files, and apply into my Kubernetes cluster.

namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: llm-server

certificate.yaml

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ilyasahsan-xyz
  namespace: llm-server
spec:
  secretName: ilyasahsan-xyz-tls
  dnsNames:
    - ilyasahsan.xyz
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: server
  namespace: llm-server
spec:
  selector:
    matchLabels:
      app: server
  template:
    metadata:
      labels:
        app: server
    spec:
      containers:
        - name: server
          image: ghcr.io/ggml-org/llama.cpp:server
          ports:
            - containerPort: 8080
          env:
            - name: LLAMA_ARG_MODEL_URL
              value: https://huggingface.co/ilyasahsan/GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf
            - name: LLAMA_ARG_CTX_SIZE
              value: "512"
            - name: LLAMA_ARG_UI
              value: "0"
          resources:
            requests:
              cpu: "250m"
              memory: "200Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

ingress.yaml

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: llm-server-ratelimit
  namespace: llm-server
spec:
  rateLimit:
    average: 10
    burst: 2
    period: 5m
---
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: llm-server-stripprefix
  namespace: llm-server
spec:
  stripPrefix:
    prefixes:
      - /llm-server
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: llm-server
  namespace: llm-server
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`ilyasahsan.xyz`) && PathPrefix(`/llm-server`)
      kind: Rule
      middlewares:
        - name: llm-server-stripprefix
        - name: llm-server-ratelimit
          namespace: llm-server
      services:
        - name: server
          port: 80
  tls:
    secretName: ilyasahsan-xyz-tls

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: server
  namespace: llm-server
spec:
  selector:
    app: server
  ports:
    - port: 80
      targetPort: 8080

Result

As a result the LLM server is now running and you can test with following command:

curl -X POST https://ilyasahsan.xyz/llm-server/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Answer the question."
        },
        {
            "role": "user",
            "content": "Hello! What is the nationality of Lionel Messi?"
        }
    ],
    "temperature": 0.2
}'

The response of the LLM server is likely below:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Lionel Messi is a football player from Argentina."
            }
        }
    ],
    ....
}

Blockers

N/A

Carry-overs

Create a live demo for LLM Inference

Reflection

N/A