Files

T

Joachim Friberg 42a5d231b8 Add Gitea bot (mimir) configuration for tea-CLI operations

- Add AGENTS.md section 11 documenting mimir bot user for Gitea
- Store token via tea logins system with repository:write and user:read scopes
- Document common tea commands for branch/PR creation and issue management
- Enable agents to create branches, commits and PRs via tea-CLI

2026-04-20 13:24:57 +02:00

5.6 KiB

Raw Blame History

Plan: Local LLM Zima App (Intel NUC8)

Context

Hardware: Intel NUC8 i7, 16GB RAM, 500GB SSD
Goal: Zima app for local LLM inference with web UI
Constraints: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
Decisions:
- Include OpenWebUI (two-container solution)
- 8G memory reservation (allows 7B Q4 models)
- App name: llama-server

Technology Decision

vLLM — REJECTED

Requires NVIDIA CUDA GPU
Cannot run on Intel NUC

llama.cpp (llama-server) — SELECTED

CPU-only, AVX2/AVX512 optimized
Built-in REST API server
Minimal footprint, fast for quantized models
Best fit for NUC8 constraints

LocalAI — BACKUP OPTION

More features (TTS, image gen, multi-model)
Can backend to llama.cpp
Heavier; only choose if extra features needed

OpenWebUI — RECOMMENDED COMPANION

Modern chat UI for LLM
Docker-based, easy to deploy alongside
Can be separate Zima app or documented companion

Architecture: Two Zima Apps

┌─────────────────────────┐     ┌─────────────────────────┐
│  llama-server           │     │  open-webui             │
│  - REST API :8080        │────▶│  - Chat UI :3000        │
│  - Serves model          │     │  - Connects to LLM API  │
└─────────────────────────┘     └─────────────────────────┘

Both are separate Zima apps, deployed independently. OpenWebUI references http://llama-server:8080 via Docker internal networking.

App 1: `llama-server`

Container: ghcr.io/ggerganov/llama.cpp:server
Port: 8080
Memory: 8G reservation

App 2: `open-webui`

Container: ghcr.io/open-webui/open-webui:main
Port: 3000
Memory: 2G reservation
Environment: OLLAMA_BASE_URL=http://llama-server:8080

App: `llama-server`

Container: `ghcr.io/ggerganov/llama.cpp:server`

Environment Variables:

Variable	Default	Description
`MODEL`	(required)	Model filename in `/models`
`CTX_SIZE`	2048	Context window size
`N_THREADS`	auto	CPU threads (auto = all)
`HOST`	0.0.0.0	Listen address
`PORT`	8080	API port
`MAX_TOKENS`	512	Max tokens to generate

Volumes:

Container	Description
`/models`	Model files (GGUF format)
`/DATA/AppData/$AppID/logs`	Server logs

Ports:

Container	Protocol	Description
8080	TCP	llama.cpp REST API

Resources:

Memory reservation: 8G (allows 7B Q4 models)

Security:

security_opt: no-new-privileges:true
cap_drop: ALL
No privileged needed (CPU-only)

Model Download (Documented in README)

Users download models manually:

# Example: Download Llama 3.2 3B Q4_K_M
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"

Recommended Models for 16GB RAM:

Model	Size	Quant	RAM Needed	Speed (est)
Llama 3.2 3B	1.8GB	Q4_K_M	~4GB	~15-20 tok/s
Phi-3.5 Mini 3B	1.8GB	Q4_K_M	~4GB	~15-20 tok/s
Mistral 7B	4.1GB	Q4_K_M	~6-7GB	~8-12 tok/s
Qwen 2.5 7B	4.4GB	Q4_K_M	~6-7GB	~8-12 tok/s

App: `open-webui`

Container: `ghcr.io/open-webui/open-webui:main`

Environment Variables:

Variable	Default	Description
`OLLAMA_BASE_URL`	http://llama-server:8080	LLM API endpoint
`WEBUI_PORT`	3000	Web UI port

Ports:

Container	Protocol	Description
3000	TCP	OpenWebUI

Resources:

Memory reservation: 2G

Notes:

Connects to http://llama-server:8080 via Docker internal networking
Requires llama-server app to be running first

File Structure

Apps/llama-server/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)

Apps/open-webui/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)

Implementation Steps

llama-server

Create Apps/llama-server/ directory
Write docker-compose.yaml with:
- Image: ghcr.io/ggerganov/llama.cpp:server
- 8G memory reservation
- Port 8080
- Model volume at /models
- Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
Write README.md with:
- Model download instructions
- First-run setup
- API testing examples
- Performance tips for NUC8
Validate with ./scripts/validate-appstore.sh

open-webui

Create Apps/open-webui/ directory
Write docker-compose.yaml with:
- Image: ghcr.io/open-webui/open-webui:main
- 2G memory reservation
- Port 3000
- Environment: OLLAMA_BASE_URL=http://llama-server:8080
Write README.md with:
- Prerequisites (llama-server must be running first)
- How to access
- Troubleshooting connection issues
Validate with ./scripts/validate-appstore.sh

Risk Assessment

Risk	Level	Mitigation
NUC8 RAM insufficient for 7B with other apps	Medium	8G reservation; close other apps for 7B
Model download issues	Low	Provide direct HF links in README
OpenWebUI API compatibility	Low	llama.cpp v1 API is OpenAI-compatible
Intel AVX2 performance	Low	llama.cpp auto-detects and uses AVX2

5.6 KiB Raw Blame History