Files

T

Joachim Friberg 0aabfc8a72 Add llama-server and open-webui apps for local LLM inference

- llama-server: llama.cpp REST API server, 8G memory, port 8080
- open-webui: Chat UI connecting to llama-server, 2G memory, port 3000
- Both include x-casaos metadata for ZimaOS app store
- README with model download instructions and API examples

2026-04-19 22:25:22 +02:00

docker-compose.yaml

Add llama-server and open-webui apps for local LLM inference

2026-04-19 22:25:22 +02:00

README.md

Add llama-server and open-webui apps for local LLM inference

2026-04-19 22:25:22 +02:00

README.md

Llama Server

Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.

Purpose

Port: 8080 (TCP)
Memory: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
Category: AI / LLM inference

CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.

Model Setup

llama-server does not bundle models. You must download GGUF files manually.

SSH into your ZimaOS device and run:

# Create models directory
mkdir -p /DATA/AppData/llama-server/models

# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"

Recommended Models for 16GB RAM

Model	Size	Quant	RAM Needed	Speed (est.)
Llama 3.2 3B	1.8GB	Q4_K_M	~4GB	~15-20 tok/s
Phi-3.5 Mini 3B	1.8GB	Q4_K_M	~4GB	~15-20 tok/s
Mistral 7B	4.1GB	Q4_K_M	~6-7GB	~8-12 tok/s
Qwen 2.5 7B	4.4GB	Q4_K_M	~6-7GB	~8-12 tok/s

For 7B models, close other apps to free RAM. 8G reservation leaves headroom.

Environment Variables

Variable	Default	Description
`MODEL`	`llama-3.2-3b-q4_k_m.gguf`	Model filename in `/models`
`CTX_SIZE`	`2048`	Context window size (tokens)
`N_THREADS`	`0`	CPU threads (0 = auto)
`HOST`	`0.0.0.0`	Listen address
`PORT`	`8080`	API port
`MAX_TOKENS`	`512`	Max tokens per response

Change MODEL to match your downloaded file. Restart container after changing.

API Testing

Once running, test the API:

# Check server info
curl http://localhost:8080/v1/models

# Chat completions (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-q4_k_m.gguf",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 128
  }'

Volumes

Path	Description
`/models`	GGUF model files
`/logs`	Server log output

Architecture

amd64 (Intel/AMD x86_64)
arm64 (Apple Silicon, ARM servers)

Security

security_opt: no-new-privileges:true
cap_drop: ALL
CPU-only, no privileged access needed