Updated March 2026 $0 API Costs

Dify GPU Hosting Guide 2026
Run Local LLMs with Dify

Host Dify on a GPU server and connect it to Ollama or LocalAI to run Llama 3, Mistral, and other open-source models locally — with zero per-token API costs and complete data privacy.

Why Run Dify on a GPU Server?

Connecting Dify to a locally-hosted LLM via Ollama or LocalAI removes dependence on cloud AI providers entirely. Here is what you gain:

💰

No API Costs

Pay only for the GPU server — not per token. High-volume usage becomes dramatically cheaper.

🔒

Data Privacy

Prompts and responses never leave your infrastructure — essential for regulated industries.

🧩

Custom Models

Run fine-tuned or domain-specific models that are not available through any public API.

🚀

No Rate Limits

Burst as many requests as your GPU can handle — no throttling, no quota errors.

GPU Cloud Providers Compared

Prices are approximate on-demand rates as of early 2026. Reserved and spot instances are typically cheaper.

Provider GPU VRAM Price/hr Best For
Lambda Labs A10 24 GB $0.75/hr Development
Vast.ai RTX 4090 24 GB ~$0.35/hr Budget
RunPod A100 80 GB $1.99/hr Production
CoreWeave H100 80 GB $2.50/hr Enterprise
Hetzner GPU A100 80 GB 2.49 EUR/hr EU compliance
1

Install CUDA and NVIDIA Container Toolkit

Before installing Dify or Ollama, you need the NVIDIA CUDA drivers and the Container Toolkit so Docker containers can access the GPU.

Install CUDA Toolkit 12.3

# Check if NVIDIA driver is already installed
nvidia-smi

# If not installed, add the NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install CUDA toolkit (includes drivers)
sudo apt install -y cuda-toolkit-12-3

# Reboot required after driver install
sudo reboot

Verify GPU and Configure Docker

# After reboot, verify GPU is detected
nvidia-smi

# Install NVIDIA Container Toolkit (for Docker GPU access)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

After running nvidia-smi, you should see your GPU listed with its driver version and VRAM. If Docker can now use --gpus all, you are ready for the next step.

2

Install Ollama and Pull LLM Models

Ollama is the easiest way to serve open-source LLMs on your GPU. It automatically detects CUDA and uses the GPU for inference.

Install Ollama and Pull Models

# Install Ollama (one-line installer)
curl -fsSL https://ollama.com/install.sh | sh

# Verify Ollama is running
ollama list

# Pull LLM models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull codellama:13b

# Test a model
ollama run llama3.1:8b "Hello, what can you do?"

Bind Ollama to All Network Interfaces

By default Ollama only listens on localhost. To make it reachable from Dify's Docker containers, you need to bind it to 0.0.0.0:

# Edit Ollama systemd service to bind to all interfaces
sudo systemctl edit ollama --force --full

# Find the [Service] section and add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"

# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollama

Configure docker-compose.override.yaml

Create or edit docker-compose.override.yaml in your Dify directory so containers can resolve host.docker.internal to the host machine on Linux:

services:
  api:
    extra_hosts:
      - "host.docker.internal:host-gateway"
  worker:
    extra_hosts:
      - "host.docker.internal:host-gateway"

Note: On macOS and Windows, host.docker.internal resolves automatically. On Linux, the extra_hosts entry above is required.

3

Connect Dify to Ollama

With Ollama running and reachable, add it as a model provider inside Dify:

  1. Open your Dify instance and click your avatar in the top-right corner.
  2. Go to Settings then Model Provider.
  3. Scroll down to find Ollama and click Add Model.
  4. Set the Base URL to http://host.docker.internal:11434.
  5. Enter the Model Name exactly as listed by ollama list (e.g. llama3.1:8b).
  6. Click Save — Dify will test the connection. A green checkmark confirms success.
  7. The model is now available in all your Dify apps and workflows.

Tip: Repeat step 5 for each model you pulled. You can add as many Ollama models as you like — each appears as a separate selectable model within Dify.

4

LocalAI — An OpenAI-Compatible Alternative

If you prefer an OpenAI-compatible API surface, LocalAI is an excellent alternative to Ollama. It exposes endpoints like /v1/chat/completions so you can use Dify's existing OpenAI integration without any extra configuration.

Run LocalAI with Docker (GPU)

# Run LocalAI with Docker (GPU-enabled)
docker run -d --gpus all -p 8080:8080 -v /path/to/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12

Once running, configure Dify with Model Provider: OpenAI-API-compatible, set the base URL to http://host.docker.internal:8080/v1, and use any model name you have loaded in LocalAI. No API key is required for local deployments.

Model Recommendations by Use Case

Choose your model based on available VRAM and the quality-speed tradeoff your application needs.

Model VRAM Required Speed Best For
llama3.1:8b ~6 GB Fast General purpose, chat
mistral:7b ~5 GB Very fast Speed-critical apps
codellama:13b ~10 GB Medium Code generation
llama3.1:70b ~40 GB Slow High-quality outputs
mixtral:8x7b ~26 GB Medium Balanced quality/speed

VRAM Quick Reference

~6 GB
7B Models
e.g. Llama 3.1 8B, Mistral 7B
~10 GB
13B Models
e.g. CodeLlama 13B
~20 GB
34B Models
e.g. CodeLlama 34B
~40 GB
70B Models
e.g. Llama 3.1 70B

These are approximate requirements for full-precision (fp16) inference. Quantized models (Q4/Q5) can reduce VRAM usage by 30–50%, allowing larger models to run on smaller GPUs.

Related Guides

Self-Host Dify Guide
Complete walkthrough for self-hosting Dify on your own server or VPS.
Dify Docker Setup
Step-by-step Docker Compose configuration for running Dify in production.
Best Dify Hosting Providers
Comparison of managed and cloud hosting options for Dify in 2026.