Files
zima-apps/Apps/llama-server
Joachim Friberg 0aabfc8a72 Add llama-server and open-webui apps for local LLM inference
- llama-server: llama.cpp REST API server, 8G memory, port 8080
- open-webui: Chat UI connecting to llama-server, 2G memory, port 3000
- Both include x-casaos metadata for ZimaOS app store
- README with model download instructions and API examples
2026-04-19 22:25:22 +02:00
..

Llama Server

Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.

Purpose

  • Port: 8080 (TCP)
  • Memory: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
  • Category: AI / LLM inference

CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.

Model Setup

llama-server does not bundle models. You must download GGUF files manually.

SSH into your ZimaOS device and run:

# Create models directory
mkdir -p /DATA/AppData/llama-server/models

# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
Model Size Quant RAM Needed Speed (est.)
Llama 3.2 3B 1.8GB Q4_K_M ~4GB ~15-20 tok/s
Phi-3.5 Mini 3B 1.8GB Q4_K_M ~4GB ~15-20 tok/s
Mistral 7B 4.1GB Q4_K_M ~6-7GB ~8-12 tok/s
Qwen 2.5 7B 4.4GB Q4_K_M ~6-7GB ~8-12 tok/s

For 7B models, close other apps to free RAM. 8G reservation leaves headroom.

Environment Variables

Variable Default Description
MODEL llama-3.2-3b-q4_k_m.gguf Model filename in /models
CTX_SIZE 2048 Context window size (tokens)
N_THREADS 0 CPU threads (0 = auto)
HOST 0.0.0.0 Listen address
PORT 8080 API port
MAX_TOKENS 512 Max tokens per response

Change MODEL to match your downloaded file. Restart container after changing.

API Testing

Once running, test the API:

# Check server info
curl http://localhost:8080/v1/models

# Chat completions (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-q4_k_m.gguf",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 128
  }'

Volumes

Path Description
/models GGUF model files
/logs Server log output

Architecture

  • amd64 (Intel/AMD x86_64)
  • arm64 (Apple Silicon, ARM servers)

Security

  • security_opt: no-new-privileges:true
  • cap_drop: ALL
  • CPU-only, no privileged access needed