Updated March 2026 $0 API Costs

Dify GPU Hosting Guide 2026
Run Local LLMs with Dify

Host Dify on a GPU server and connect it to Ollama or LocalAI to run Llama 3, Mistral, and other open-source models locally — with zero per-token API costs and complete data privacy.

Why Run Dify on a GPU Server?

Connecting Dify to a locally-hosted LLM via Ollama or LocalAI removes dependence on cloud AI providers entirely. Here is what you gain:

💰

No API Costs

Pay only for the GPU server — not per token. High-volume usage becomes dramatically cheaper.

🔒

Data Privacy

Prompts and responses never leave your infrastructure — essential for regulated industries.

🧩

Custom Models

Run fine-tuned or domain-specific models that are not available through any public API.

🚀

No Rate Limits

Burst as many requests as your GPU can handle — no throttling, no quota errors.

GPU Cloud Providers Compared

Prices are approximate on-demand rates as of early 2026. Reserved and spot instances are typically cheaper.

Provider	GPU	VRAM	Price/hr	Best For
Lambda Labs	A10	24 GB	$0.75/hr	Development
Vast.ai	RTX 4090	24 GB	~$0.35/hr	Budget
RunPod	A100	80 GB	$1.99/hr	Production
CoreWeave	H100	80 GB	$2.50/hr	Enterprise
Hetzner GPU	A100	80 GB	2.49 EUR/hr	EU compliance

Install CUDA and NVIDIA Container Toolkit

Before installing Dify or Ollama, you need the NVIDIA CUDA drivers and the Container Toolkit so Docker containers can access the GPU.

Install CUDA Toolkit 12.3

# Check if NVIDIA driver is already installed
nvidia-smi

# If not installed, add the NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install CUDA toolkit (includes drivers)
sudo apt install -y cuda-toolkit-12-3

# Reboot required after driver install
sudo reboot

Verify GPU and Configure Docker

# After reboot, verify GPU is detected
nvidia-smi

# Install NVIDIA Container Toolkit (for Docker GPU access)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

After running nvidia-smi, you should see your GPU listed with its driver version and VRAM. If Docker can now use --gpus all, you are ready for the next step.

Install Ollama and Pull LLM Models

Ollama is the easiest way to serve open-source LLMs on your GPU. It automatically detects CUDA and uses the GPU for inference.

Install Ollama and Pull Models

# Install Ollama (one-line installer)
curl -fsSL https://ollama.com/install.sh | sh

# Verify Ollama is running
ollama list

# Pull LLM models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull codellama:13b

# Test a model
ollama run llama3.1:8b "Hello, what can you do?"

Bind Ollama to All Network Interfaces

By default Ollama only listens on localhost. To make it reachable from Dify's Docker containers, you need to bind it to 0.0.0.0:

# Edit Ollama systemd service to bind to all interfaces
sudo systemctl edit ollama --force --full

# Find the [Service] section and add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"

# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollama

Configure docker-compose.override.yaml

Create or edit docker-compose.override.yaml in your Dify directory so containers can resolve host.docker.internal to the host machine on Linux:

services:
  api:
    extra_hosts:
      - "host.docker.internal:host-gateway"
  worker:
    extra_hosts:
      - "host.docker.internal:host-gateway"

Note: On macOS and Windows, host.docker.internal resolves automatically. On Linux, the extra_hosts entry above is required.

Connect Dify to Ollama

With Ollama running and reachable, add it as a model provider inside Dify:

Open your Dify instance and click your avatar in the top-right corner.
Go to Settings then Model Provider.
Scroll down to find Ollama and click Add Model.
Set the Base URL to http://host.docker.internal:11434.
Enter the Model Name exactly as listed by ollama list (e.g. llama3.1:8b).
Click Save — Dify will test the connection. A green checkmark confirms success.
The model is now available in all your Dify apps and workflows.

Tip: Repeat step 5 for each model you pulled. You can add as many Ollama models as you like — each appears as a separate selectable model within Dify.

LocalAI — An OpenAI-Compatible Alternative

If you prefer an OpenAI-compatible API surface, LocalAI is an excellent alternative to Ollama. It exposes endpoints like /v1/chat/completions so you can use Dify's existing OpenAI integration without any extra configuration.

Run LocalAI with Docker (GPU)

# Run LocalAI with Docker (GPU-enabled)
docker run -d --gpus all -p 8080:8080 -v /path/to/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12

Once running, configure Dify with Model Provider: OpenAI-API-compatible, set the base URL to http://host.docker.internal:8080/v1, and use any model name you have loaded in LocalAI. No API key is required for local deployments.

Model Recommendations by Use Case

Choose your model based on available VRAM and the quality-speed tradeoff your application needs.

Model	VRAM Required	Speed	Best For
`llama3.1:8b`	~6 GB	Fast	General purpose, chat
`mistral:7b`	~5 GB	Very fast	Speed-critical apps
`codellama:13b`	~10 GB	Medium	Code generation
`llama3.1:70b`	~40 GB	Slow	High-quality outputs
`mixtral:8x7b`	~26 GB	Medium	Balanced quality/speed

VRAM Quick Reference

~6 GB

7B Models

e.g. Llama 3.1 8B, Mistral 7B

~10 GB

13B Models

e.g. CodeLlama 13B

~20 GB

34B Models

e.g. CodeLlama 34B

~40 GB

70B Models

e.g. Llama 3.1 70B

These are approximate requirements for full-precision (fp16) inference. Quantized models (Q4/Q5) can reduce VRAM usage by 30–50%, allowing larger models to run on smaller GPUs.