Cached models

Cached models are currently in beta. Join our Discord if you’d like to provide feedback.

Enabling cached models on your Serverless endpoint can reduce cold start times to just a few seconds and dramatically reduce the cost for loading large models.

Why use cached models?

Faster cold starts: A “cold start” refers to the delay between when a request is received by an endpoint with no running workers and when a worker is fully “warmed up” and ready to handle the request. Using cached models can reduce cold start times to just a few seconds, even for large models.
Reduced costs: You aren’t billed for worker time while your model is being downloaded. This is especially impactful for large models that can take several minutes to load.
Accelerated deployment: You can deploy cached models instantly without waiting for external downloads or transfers.
Smaller container images: By decoupling models from your container image, you can create smaller, more focused images that contain only your application logic.

How it works

When you select a cached model for your endpoint, Runpod automatically tries to start your workers on hosts that already contain the selected model. If no cached host machines are available, the system delays starting your workers until the model is downloaded onto the machine where your workers will run, ensuring you still won’t be charged for the download time.

Where models are stored

Cached models are stored on the worker container’s local disk, separate from any attached network volumes. Runpod automatically manages this internal storage to optimize loading speed. The cache persists across requests on the same worker, so once a worker initializes, you’ll see consistent performance. Since the models live on local disk rather than network volumes, they won’t appear on your attached network volumes.

Accessing cached models

Cached models are stored at /runpod-volume/huggingface-cache/hub/. The directory structure follows Hugging Face cache conventions, where forward slashes (/) in the model name are replaced with double dashes (--). The path structure follows this pattern:

/runpod-volume/huggingface-cache/hub/models--{organization}--{model-name}/

For example, meta-llama/Llama-3.2-1B-Instruct would be stored at:

/runpod-volume/huggingface-cache/hub/models--meta-llama--Llama-3.2-1B-Instruct/

Using cached models in applications

You can access cached models in your application two ways: Direct configuration: Configure your application to load models directly from /runpod-volume/huggingface-cache/hub/. Many frameworks and tools let you specify a custom cache directory for Hugging Face models. Symbolic links: Create symbolic links from your application’s expected model directory to the cache location. This is particularly useful for applications like ComfyUI that expect models in specific directories. For example, create a symbolic link like this:

ln -s /runpod-volume/huggingface-cache/hub/models--meta-llama--Llama-3.2-1B-Instruct/ /workspace/models/llama-3.2

This lets your application access cached models without modifying its configuration.

Enabling cached models

Follow these steps to select and add a cached model to your Serverless endpoint:

Create a new endpoint

Navigate to the Serverless section of the console and click New Endpoint.

Configure the model

In the Endpoint Configuration step, scroll down to Model (optional) and add the link for the model you want to use.For example, https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct.

Add an access token (if needed)

If you’re using a gated model, you’ll need to enter a Hugging Face access token.

Deploy the endpoint

Complete your endpoint configuration and click Deploy Endpoint .

You can add a cached model to an existing endpoint by selecting Manage → Edit Endpoint in the endpoint details page and updating the Model (optional) field.

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

Why use cached models?

How it works

Where models are stored

Accessing cached models

Using cached models in applications

Enabling cached models

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

​Why use cached models?

​How it works

​Where models are stored

​Accessing cached models

​Using cached models in applications

​Enabling cached models

Why use cached models?

How it works

Where models are stored

Accessing cached models

Using cached models in applications

Enabling cached models