Dify GPU Hosting Guide 2026
Run Local LLMs with Dify
Host Dify on a GPU server and connect it to Ollama or LocalAI to run Llama 3, Mistral, and other open-source models locally — with zero per-token API costs and complete data privacy.
Why Run Dify on a GPU Server?
Connecting Dify to a locally-hosted LLM via Ollama or LocalAI removes dependence on cloud AI providers entirely. Here is what you gain:
No API Costs
Pay only for the GPU server — not per token. High-volume usage becomes dramatically cheaper.
Data Privacy
Prompts and responses never leave your infrastructure — essential for regulated industries.
Custom Models
Run fine-tuned or domain-specific models that are not available through any public API.
No Rate Limits
Burst as many requests as your GPU can handle — no throttling, no quota errors.
GPU Cloud Providers Compared
Prices are approximate on-demand rates as of early 2026. Reserved and spot instances are typically cheaper.
| Provider | GPU | VRAM | Price/hr | Best For |
|---|---|---|---|---|
| Lambda Labs | A10 | 24 GB | $0.75/hr | Development |
| Vast.ai | RTX 4090 | 24 GB | ~$0.35/hr | Budget |
| RunPod | A100 | 80 GB | $1.99/hr | Production |
| CoreWeave | H100 | 80 GB | $2.50/hr | Enterprise |
| Hetzner GPU | A100 | 80 GB | 2.49 EUR/hr | EU compliance |
Install CUDA and NVIDIA Container Toolkit
Before installing Dify or Ollama, you need the NVIDIA CUDA drivers and the Container Toolkit so Docker containers can access the GPU.
Install CUDA Toolkit 12.3
# Check if NVIDIA driver is already installed
nvidia-smi
# If not installed, add the NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
# Install CUDA toolkit (includes drivers)
sudo apt install -y cuda-toolkit-12-3
# Reboot required after driver install
sudo reboot Verify GPU and Configure Docker
# After reboot, verify GPU is detected
nvidia-smi
# Install NVIDIA Container Toolkit (for Docker GPU access)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker After running nvidia-smi, you should see your GPU listed with its driver version and VRAM. If Docker can now use --gpus all, you are ready for the next step.
Install Ollama and Pull LLM Models
Ollama is the easiest way to serve open-source LLMs on your GPU. It automatically detects CUDA and uses the GPU for inference.
Install Ollama and Pull Models
# Install Ollama (one-line installer)
curl -fsSL https://ollama.com/install.sh | sh
# Verify Ollama is running
ollama list
# Pull LLM models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull codellama:13b
# Test a model
ollama run llama3.1:8b "Hello, what can you do?" Bind Ollama to All Network Interfaces
By default Ollama only listens on localhost. To make it reachable from Dify's Docker containers, you need to bind it to 0.0.0.0:
# Edit Ollama systemd service to bind to all interfaces
sudo systemctl edit ollama --force --full
# Find the [Service] section and add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollama Configure docker-compose.override.yaml
Create or edit docker-compose.override.yaml in your Dify directory so containers can resolve host.docker.internal to the host machine on Linux:
services:
api:
extra_hosts:
- "host.docker.internal:host-gateway"
worker:
extra_hosts:
- "host.docker.internal:host-gateway" Note: On macOS and Windows, host.docker.internal resolves automatically. On Linux, the extra_hosts entry above is required.
Connect Dify to Ollama
With Ollama running and reachable, add it as a model provider inside Dify:
- Open your Dify instance and click your avatar in the top-right corner.
- Go to Settings then Model Provider.
- Scroll down to find Ollama and click Add Model.
- Set the Base URL to
http://host.docker.internal:11434. - Enter the Model Name exactly as listed by
ollama list(e.g.llama3.1:8b). - Click Save — Dify will test the connection. A green checkmark confirms success.
- The model is now available in all your Dify apps and workflows.
Tip: Repeat step 5 for each model you pulled. You can add as many Ollama models as you like — each appears as a separate selectable model within Dify.
LocalAI — An OpenAI-Compatible Alternative
If you prefer an OpenAI-compatible API surface, LocalAI is an excellent alternative to Ollama. It exposes endpoints like /v1/chat/completions so you can use Dify's existing OpenAI integration without any extra configuration.
Run LocalAI with Docker (GPU)
# Run LocalAI with Docker (GPU-enabled)
docker run -d --gpus all -p 8080:8080 -v /path/to/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12 Once running, configure Dify with Model Provider: OpenAI-API-compatible, set the base URL to http://host.docker.internal:8080/v1, and use any model name you have loaded in LocalAI. No API key is required for local deployments.
Model Recommendations by Use Case
Choose your model based on available VRAM and the quality-speed tradeoff your application needs.
| Model | VRAM Required | Speed | Best For |
|---|---|---|---|
llama3.1:8b | ~6 GB | Fast | General purpose, chat |
mistral:7b | ~5 GB | Very fast | Speed-critical apps |
codellama:13b | ~10 GB | Medium | Code generation |
llama3.1:70b | ~40 GB | Slow | High-quality outputs |
mixtral:8x7b | ~26 GB | Medium | Balanced quality/speed |
VRAM Quick Reference
These are approximate requirements for full-precision (fp16) inference. Quantized models (Q4/Q5) can reduce VRAM usage by 30–50%, allowing larger models to run on smaller GPUs.