Serve t5-small on Kubernetes for ML Inference

This blog covers how I built the following playground from scratch:

https://ilyasahsan.xyz/t5-small

Introduction

This is a guidance on how I deployed an inference service using the t5-small machine learning model.

There is no token cost because the machine learning model is running within my Kubernetes cluster.

Prerequisites

  1. Kubernetes cluster
  2. Helm
  3. Kubernetes CLI

Architecture Overview

Everything runs on one Kubernetes cluster (K3s) on a single Hetzner VM.

t5-small architecture diagram

  1. cert-manager — issues TLS certificates for HTTPS.
  2. KServe namespace — the control plane; apply the inference-service.yaml into a running model server.
  3. t5-small namespace — where the model actually runs and serves requests.
  4. Traefik — the load balancer handle the network.

Step 1: Deploy Required Helm Chart Packages

install cert-manager

helm install cert-manager jetstack/cert-manager \
    --namespace cert-manager \
    --create-namespace \
    --set crds.enabled=true

install kserve-crd

helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
    -n kserve \
    --version v0.18.0 \
    --create-namespace

install kserve-resources

helm install kserve oci://ghcr.io/kserve/charts/kserve-resources \
    -n kserve \
    --version v0.18.0 \
    --set kserve.controller.deploymentMode=Standard \
    --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
    --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway

Ensure these installed packages running with the following command:

kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=300s
kubectl wait --for=condition=Ready pods --all -n kserve --timeout=300s

Step 2: Create Kubernetes Manifest files

Create the following YAML files:

  1. namespace-t5-small.yaml
  2. certificate.yaml
  3. inference-service.yaml
  4. ingress.yaml
  5. serving-runtime.yaml

The content of each file are follows:

file: namespace-t5-small.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: t5-small

file: certificate.yaml

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ilyasahsan-xyz
  namespace: t5-small
spec:
  secretName: ilyasahsan-xyz-tls
  dnsNames:
    - ilyasahsan.xyz
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

file: inference-service.yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: t5-small-inference
  namespace: t5-small
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=t5-small
        - --model_id=google-t5/t5-small
        - --dtype=float32
        - --backend=huggingface

file: ingress.yaml

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: t5-small-stripprefix
  namespace: t5-small
spec:
  stripPrefix:
    prefixes:
      - /t5-small/inference
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: t5-small
  namespace: t5-small
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`ilyasahsan.xyz`) && PathPrefix(`/t5-small/inference`)
      kind: Rule
      middlewares:
        - name: t5-small-stripprefix
      services:
        - name: t5-small-inference-predictor
          port: 80
  tls:
    secretName: ilyasahsan-xyz-tls

file: serving-runtime.yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-huggingfaceserver
  namespace: t5-small
spec:
  supportedModelFormats:
    - name: huggingface
      version: "1"
      autoSelect: true
      priority: 1
  protocolVersions:
    - v1
    - v2
  containers:
    - name: kserve-container
      image: kserve/huggingfaceserver:v0.15.2
      args:
        - --model_name={{.Name}}
        - --model_dir=/mnt/models
        - --http_port=8080
      resources:
        requests:
          cpu: "250m"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "1Gi"

Step 3: Apply the Kubernetes Manifests.

Create a Kubernetes Namespace

kubectl apply -f default/namespace-t5-small.yaml

Apply the manifest

kubectl apply -f <path-to-manifest-directory>

Ensure the inference service running with the following command:

kubectl wait --for=condition=Ready pods --all -n t5-small --timeout=300s

Conclusion

Test the inference service to translate a simple sentence with the following command:

curl -s https://ilyasahsan.xyz/t5-small/inference/openai/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "t5-small", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'

Or you can access the playground on the following link

https://ilyasahsan.xyz/t5-small

References

  1. Research: exploring-transfer-learning-with-t5-the-text-to-text-transfer-transformer
  2. Machine Learning Model: t5-small
  3. Inference Platform: KServe
  4. Serving Runtime: huggingfaceserver