Files
zima-apps/.kilo/plans/1776627951342-lucky-tiger.md
T
Joachim Friberg 42a5d231b8 Add Gitea bot (mimir) configuration for tea-CLI operations
- Add AGENTS.md section 11 documenting mimir bot user for Gitea
- Store token via tea logins system with repository:write and user:read scopes
- Document common tea commands for branch/PR creation and issue management
- Enable agents to create branches, commits and PRs via tea-CLI
2026-04-20 13:24:57 +02:00

5.6 KiB

Plan: Local LLM Zima App (Intel NUC8)

Context

  • Hardware: Intel NUC8 i7, 16GB RAM, 500GB SSD
  • Goal: Zima app for local LLM inference with web UI
  • Constraints: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
  • Decisions:
    • Include OpenWebUI (two-container solution)
    • 8G memory reservation (allows 7B Q4 models)
    • App name: llama-server

Technology Decision

vLLM — REJECTED

  • Requires NVIDIA CUDA GPU
  • Cannot run on Intel NUC

llama.cpp (llama-server) — SELECTED

  • CPU-only, AVX2/AVX512 optimized
  • Built-in REST API server
  • Minimal footprint, fast for quantized models
  • Best fit for NUC8 constraints

LocalAI — BACKUP OPTION

  • More features (TTS, image gen, multi-model)
  • Can backend to llama.cpp
  • Heavier; only choose if extra features needed
  • Modern chat UI for LLM
  • Docker-based, easy to deploy alongside
  • Can be separate Zima app or documented companion

Architecture: Two Zima Apps

┌─────────────────────────┐     ┌─────────────────────────┐
│  llama-server           │     │  open-webui             │
│  - REST API :8080        │────▶│  - Chat UI :3000        │
│  - Serves model          │     │  - Connects to LLM API  │
└─────────────────────────┘     └─────────────────────────┘

Both are separate Zima apps, deployed independently. OpenWebUI references http://llama-server:8080 via Docker internal networking.

App 1: llama-server

  • Container: ghcr.io/ggerganov/llama.cpp:server
  • Port: 8080
  • Memory: 8G reservation

App 2: open-webui

  • Container: ghcr.io/open-webui/open-webui:main
  • Port: 3000
  • Memory: 2G reservation
  • Environment: OLLAMA_BASE_URL=http://llama-server:8080

App: llama-server

Container: ghcr.io/ggerganov/llama.cpp:server

Environment Variables:

Variable Default Description
MODEL (required) Model filename in /models
CTX_SIZE 2048 Context window size
N_THREADS auto CPU threads (auto = all)
HOST 0.0.0.0 Listen address
PORT 8080 API port
MAX_TOKENS 512 Max tokens to generate

Volumes:

Container Description
/models Model files (GGUF format)
/DATA/AppData/$AppID/logs Server logs

Ports:

Container Protocol Description
8080 TCP llama.cpp REST API

Resources:

  • Memory reservation: 8G (allows 7B Q4 models)

Security:

  • security_opt: no-new-privileges:true
  • cap_drop: ALL
  • No privileged needed (CPU-only)

Model Download (Documented in README)

Users download models manually:

# Example: Download Llama 3.2 3B Q4_K_M
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"

Recommended Models for 16GB RAM:

Model Size Quant RAM Needed Speed (est)
Llama 3.2 3B 1.8GB Q4_K_M ~4GB ~15-20 tok/s
Phi-3.5 Mini 3B 1.8GB Q4_K_M ~4GB ~15-20 tok/s
Mistral 7B 4.1GB Q4_K_M ~6-7GB ~8-12 tok/s
Qwen 2.5 7B 4.4GB Q4_K_M ~6-7GB ~8-12 tok/s

App: open-webui

Container: ghcr.io/open-webui/open-webui:main

Environment Variables:

Variable Default Description
OLLAMA_BASE_URL http://llama-server:8080 LLM API endpoint
WEBUI_PORT 3000 Web UI port

Ports:

Container Protocol Description
3000 TCP OpenWebUI

Resources:

  • Memory reservation: 2G

Notes:

  • Connects to http://llama-server:8080 via Docker internal networking
  • Requires llama-server app to be running first

File Structure

Apps/llama-server/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)

Apps/open-webui/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)

Implementation Steps

llama-server

  1. Create Apps/llama-server/ directory
  2. Write docker-compose.yaml with:
    • Image: ghcr.io/ggerganov/llama.cpp:server
    • 8G memory reservation
    • Port 8080
    • Model volume at /models
    • Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
  3. Write README.md with:
    • Model download instructions
    • First-run setup
    • API testing examples
    • Performance tips for NUC8
  4. Validate with ./scripts/validate-appstore.sh

open-webui

  1. Create Apps/open-webui/ directory
  2. Write docker-compose.yaml with:
    • Image: ghcr.io/open-webui/open-webui:main
    • 2G memory reservation
    • Port 3000
    • Environment: OLLAMA_BASE_URL=http://llama-server:8080
  3. Write README.md with:
    • Prerequisites (llama-server must be running first)
    • How to access
    • Troubleshooting connection issues
  4. Validate with ./scripts/validate-appstore.sh

Risk Assessment

Risk Level Mitigation
NUC8 RAM insufficient for 7B with other apps Medium 8G reservation; close other apps for 7B
Model download issues Low Provide direct HF links in README
OpenWebUI API compatibility Low llama.cpp v1 API is OpenAI-compatible
Intel AVX2 performance Low llama.cpp auto-detects and uses AVX2