Model Storage

You must store your model before you can deploy it. You can store a model in an S3 bucket, a Persistent Volume Claim (PVC), or Open Container Initiative (OCI) containers.

In cloud-native inference scenarios, model storage determines the startup speed, version management granularity, and scalability of inference services. KServe loads models through two main mechanisms:

  • Storage Initializer (Init Container): For S3 and PVC, downloads/mounts data before the main container starts.
  • Sidecar: For OCI images, achieves second-level loading using the container runtime's layered caching capability.

Using S3 Object Storage for model storage

This is the most commonly used mode. It implements credential management through Secret with specific labels.

Authentication Configuration

It is recommended to create separate ServiceAccount and Secret for each project.

S3 Key Configuration Parameters

Configuration ItemActual ValueDescription
Endpointyour-s3-service-ip:your-s3-port IP and port pointing to private MinIO service
Region(Not specified)Default is usually us-east-1, KServe will use default value if not detected
HTTPS Enabled0Encryption disabled for internal test/Demo environment
Authentication MethodStatic Access Key / Secret KeyManaged through Secret named minio-creds
Namespace Isolationdemo-spacePermissions limited to this namespace, following multi-tenant isolation principles
apiVersion: v1
data:
  AWS_ACCESS_KEY_ID: YOUR_BASE64_ENCODED_ACCESS_KEY
  AWS_SECRET_ACCESS_KEY: YOUR_BASE64_ENCODED_SECRET_KEY
kind: Secret
metadata:
  annotations:
    serving.kserve.io/s3-endpoint: your_s3_service_ip:your_s3_port
    serving.kserve.io/s3-usehttps: "0"
  name: minio-creds
  namespace: demo-space
type: Opaque

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-models
  namespace: demo-space
secrets:
- name: minio-creds
  1. Replace YOUR_BASE64_ENCODED_ACCESS_KEY with your actual Base64-encoded AWS access key ID.
  2. Replace YOUR_BASE64_ENCODED_SECRET_KEY with your actual Base64-encoded AWS secret access key.
  3. Replace your_s3_service_ip:your_s3_port with the actual IP address and port of your S3 service.
  4. Set serving.kserve.io/s3-usehttps to "1" if your S3 service uses HTTPS, or "0" if it uses HTTP.

Deploy Inference Service

kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
  annotations:
    aml-model-repo: Qwen2.5-0.5B-Instruct
    aml-pipeline-tag: text-generation
    serving.kserve.io/deploymentMode: Standard
  labels:
    aml.cpaas.io/runtime-type: vllm
  name: s3-demo
  namespace: demo-space
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: transformers
      name: ''
      protocolVersion: v2
      resources:
        limits:
          cpu: '2'
          ephemeral-storage: 10Gi
          memory: 8Gi
        requests:
          cpu: '2'
          memory: 4Gi
      runtime: aml-vllm-0.11.2-cpu
      storageUri: s3://models/Qwen2.5-0.5B-Instruct
    securityContext:
      seccompProfile:
        type: RuntimeDefault
    serviceAccountName: sa-models
  1. Replace Qwen2.5-0.5B-Instruct with your actual model name.
  2. aml.cpaas.io/runtime-type: vllm specifies the code runtime type. For more information about custom inference runtimes, see Extend Inference Runtimes.
  3. Replace aml-vllm-0.11.2-cpu with the runtime name that is already installed in your platform (corresponding to a ClusterServingRuntime CRD instance).
  4. storageUri: s3://models/Qwen2.5-0.5B-Instruct specifies the S3 bucket URI where the model is stored.

Using OCI containers for model storage

As an alternative to storing a model in an S3 bucket or PVC, you can store models in Open Container Initiative (OCI) containers. Deploying models from OCI containers is also known as modelcars in KServe. This approach is ideal for offline environments and enterprise internal registries such as Quay or Harbor.

Using OCI containers for model storage can help you:

  • Reduce startup times by avoiding downloading the same model multiple times.
  • Reduce disk space usage by reducing the number of models downloaded locally.
  • Improve model performance by allowing pre-fetched images.

Model Image Packaging

Create a Containerfile to build the model image.

Option 1: Using Busybox Base Image (Alauda AI Recommendation)

# Use lightweight busybox as base image
FROM busybox

# Create directory for model and set permissions
RUN mkdir -p /models && chmod 775 /models

# Copy local model folder contents to /models directory in image
COPY models/ /models/

Option 2: Using UBI Micro Base Image (Red Hat Recommendation)

Note

  • Specify a base image that provides a shell (for example, ubi9-micro). You cannot specify an empty image that does not provide a shell, such as scratch, because KServe uses the shell to ensure the model files are accessible to the model server.
  • Change the ownership of the copied model files and grant read permissions to the root group. This ensures that the model server can access the files, since containers may run with a random user ID and the root group ID.
FROM registry.access.redhat.com/ubi9/ubi-micro:latest

# Copy model files and set ownership to root group
COPY --chown=0:0 models/ /models/

# Grant read and execute permissions to all users
RUN chmod -R a=rX /models

# Run as nobody user for security
USER 65534

Building and Pushing the Model Image

After creating the Containerfile, build and push the image to your registry:

  1. Create a temporary directory and copy your model files into a models/ subfolder:

    cd $(mktemp -d)
    mkdir -p models
    cp -r /path/to/your-model/* models/
  2. Build the OCI container image:

    # Using Podman
    podman build --format=oci -t <registry>/<repository>:<tag> .
    
    # Using Docker
    docker build -t <registry>/<repository>:<tag> .
  3. Push the image to your container registry:

    # Using Podman
    podman push <registry>/<repository>:<tag>
    
    # Using Docker
    docker push <registry>/<repository>:<tag>

    Note If your repository is private, ensure that you are authenticated to the registry before pushing the image.

Deploy Inference Service

Prerequisites

  • The namespace where the inference service is located must have PSA (Pod Security Admission) Enforce set to Privilege.

KServe supports native OCI protocol:

kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
  annotations:
    aml-model-repo: Qwen2.5-0.5B-Instruct
    aml-pipeline-tag: text-generation
    serving.kserve.io/deploymentMode: Standard
  labels:
    aml-pipeline-tag: text-generation
    aml.cpaas.io/runtime-type: vllm
  name: oci-demo
  namespace: demo-space
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: transformers
      protocolVersion: v2
      resources:
        limits:
          cpu: '2'
          ephemeral-storage: 10Gi
          memory: 8Gi
        requests:
          cpu: '2'
          memory: 4Gi
      runtime: aml-vllm-0.11.2-cpu
      storageUri: oci://build-harbor.alauda.cn/test/qwen-oci:v1.0.0
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  1. Replace Qwen2.5-0.5B-Instruct with your actual model name.
  2. aml.cpaas.io/runtime-type: vllm specifies the code runtime type. For more information about custom inference runtimes, see Extend Inference Runtimes.
  3. Replace aml-vllm-0.11.2-cpu with the runtime name that is already installed in your platform (corresponding to a ClusterServingRuntime CRD instance).
  4. storageUri: oci://build-harbor.alauda.cn/test/qwen-oci:v1.0.0 specifies the OCI image URI with tag where the model is stored.

Using PVC for model storage

Uploading model files to a PVC

When deploying a model, you can serve it from a preexisting Persistent Volume Claim (PVC) where your model files are stored. You can upload your local model files to a PVC in the IDE that you access from a running workbench.

Prerequisites

  • You have access to the Alauda AI dashboard.

  • You have access to a project that has a running workbench.

  • You have created a persistent volume claim (PVC).

  • The workbench is attached to the persistent volume (PVC).

    For instructions on creating a workbench and attaching a PVC, see Create Workbench.

  • You have the model files saved on your local machine.

Procedure

Follow these steps to upload your model files to the PVC within your workbench:

  1. From the Alauda AI dashboard, click Workbench to enter the workbench list page.

  2. Find your running workbench instance and click the Connect button to enter the workbench.

  3. In your workbench IDE, navigate to the file browser:

    • In JupyterLab, this is the Files tab in the left sidebar.
    • In code-server, this is the Explorer view in the left sidebar.
  4. In the file browser, navigate to the home directory. This directory represents the root of your attached PVC.

    Note Any files or folders that you create or upload to this folder persist in the PVC.

  5. Optional: Create a new folder to organize your models:

    • In the file browser, right-click within the home directory and select New Folder.
    • Name the folder (for example, models).
    • Double-click the new models folder to enter it.
  6. Upload your model files to the current folder:

    • Using JupyterLab:
      • Click the Upload button in the file browser toolbar.
      • In the file selection dialog, navigate to and select the model files from your local computer. Click Open.
      • Wait for the upload to complete.
    • Using code-server:
      • Drag the model files directly from your local file explorer and drop them into the file browser pane in the target folder within code-server.
      • Wait for the upload process to complete.

Verification

Confirm that your files appear in the file browser at the path where you uploaded them.

Next Steps

When deploying a model from a PVC, set the storageUri in the format pvc://<pvc-name>/<optional-path>. For example:

  • pvc://model-pvc — loads from the root of the PVC.
  • pvc://model-pvc/models/Qwen2.5-0.5B-Instruct — loads from a specific subdirectory.

Deploy Inference Service

kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
  annotations:
    aml-model-repo: Qwen2.5-0.5B-Instruct
    aml-pipeline-tag: text-generation
    serving.kserve.io/deploymentMode: Standard
  labels:
    aml.cpaas.io/runtime-type: vllm
  name: pvc-demo-1
  namespace: demo-space
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: transformers
      protocolVersion: v2
      resources:
        limits:
          cpu: '2'
          ephemeral-storage: 10Gi
          memory: 8Gi
        requests:
          cpu: '2'
          memory: 4Gi
      runtime: aml-vllm-0.11.2-cpu
      storageUri: pvc://model-pvc
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  1. Replace Qwen2.5-0.5B-Instruct with your actual model name.
  2. aml.cpaas.io/runtime-type: vllm specifies the code runtime type. For more information about custom inference runtimes, see Extend Inference Runtimes.
  3. Replace aml-vllm-0.11.2-cpu with the runtime name that is already installed in your platform (corresponding to a ClusterServingRuntime CRD instance).
  4. storageUri: pvc://model-pvc specifies the PVC name where the model is stored.