42a5d231b8
- Add AGENTS.md section 11 documenting mimir bot user for Gitea - Store token via tea logins system with repository:write and user:read scopes - Document common tea commands for branch/PR creation and issue management - Enable agents to create branches, commits and PRs via tea-CLI
5.6 KiB
5.6 KiB
Plan: Local LLM Zima App (Intel NUC8)
Context
- Hardware: Intel NUC8 i7, 16GB RAM, 500GB SSD
- Goal: Zima app for local LLM inference with web UI
- Constraints: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
- Decisions:
- Include OpenWebUI (two-container solution)
- 8G memory reservation (allows 7B Q4 models)
- App name:
llama-server
Technology Decision
vLLM — REJECTED
- Requires NVIDIA CUDA GPU
- Cannot run on Intel NUC
llama.cpp (llama-server) — SELECTED
- CPU-only, AVX2/AVX512 optimized
- Built-in REST API server
- Minimal footprint, fast for quantized models
- Best fit for NUC8 constraints
LocalAI — BACKUP OPTION
- More features (TTS, image gen, multi-model)
- Can backend to llama.cpp
- Heavier; only choose if extra features needed
OpenWebUI — RECOMMENDED COMPANION
- Modern chat UI for LLM
- Docker-based, easy to deploy alongside
- Can be separate Zima app or documented companion
Architecture: Two Zima Apps
┌─────────────────────────┐ ┌─────────────────────────┐
│ llama-server │ │ open-webui │
│ - REST API :8080 │────▶│ - Chat UI :3000 │
│ - Serves model │ │ - Connects to LLM API │
└─────────────────────────┘ └─────────────────────────┘
Both are separate Zima apps, deployed independently. OpenWebUI references http://llama-server:8080 via Docker internal networking.
App 1: llama-server
- Container:
ghcr.io/ggerganov/llama.cpp:server - Port: 8080
- Memory: 8G reservation
App 2: open-webui
- Container:
ghcr.io/open-webui/open-webui:main - Port: 3000
- Memory: 2G reservation
- Environment:
OLLAMA_BASE_URL=http://llama-server:8080
App: llama-server
Container: ghcr.io/ggerganov/llama.cpp:server
Environment Variables:
| Variable | Default | Description |
|---|---|---|
MODEL |
(required) | Model filename in /models |
CTX_SIZE |
2048 | Context window size |
N_THREADS |
auto | CPU threads (auto = all) |
HOST |
0.0.0.0 | Listen address |
PORT |
8080 | API port |
MAX_TOKENS |
512 | Max tokens to generate |
Volumes:
| Container | Description |
|---|---|
/models |
Model files (GGUF format) |
/DATA/AppData/$AppID/logs |
Server logs |
Ports:
| Container | Protocol | Description |
|---|---|---|
| 8080 | TCP | llama.cpp REST API |
Resources:
- Memory reservation: 8G (allows 7B Q4 models)
Security:
security_opt: no-new-privileges:truecap_drop: ALL- No privileged needed (CPU-only)
Model Download (Documented in README)
Users download models manually:
# Example: Download Llama 3.2 3B Q4_K_M
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
Recommended Models for 16GB RAM:
| Model | Size | Quant | RAM Needed | Speed (est) |
|---|---|---|---|---|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
App: open-webui
Container: ghcr.io/open-webui/open-webui:main
Environment Variables:
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://llama-server:8080 | LLM API endpoint |
WEBUI_PORT |
3000 | Web UI port |
Ports:
| Container | Protocol | Description |
|---|---|---|
| 3000 | TCP | OpenWebUI |
Resources:
- Memory reservation: 2G
Notes:
- Connects to
http://llama-server:8080via Docker internal networking - Requires
llama-serverapp to be running first
File Structure
Apps/llama-server/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)
Apps/open-webui/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)
Implementation Steps
llama-server
- Create
Apps/llama-server/directory - Write
docker-compose.yamlwith:- Image:
ghcr.io/ggerganov/llama.cpp:server - 8G memory reservation
- Port 8080
- Model volume at
/models - Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
- Image:
- Write
README.mdwith:- Model download instructions
- First-run setup
- API testing examples
- Performance tips for NUC8
- Validate with
./scripts/validate-appstore.sh
open-webui
- Create
Apps/open-webui/directory - Write
docker-compose.yamlwith:- Image:
ghcr.io/open-webui/open-webui:main - 2G memory reservation
- Port 3000
- Environment:
OLLAMA_BASE_URL=http://llama-server:8080
- Image:
- Write
README.mdwith:- Prerequisites (llama-server must be running first)
- How to access
- Troubleshooting connection issues
- Validate with
./scripts/validate-appstore.sh
Risk Assessment
| Risk | Level | Mitigation |
|---|---|---|
| NUC8 RAM insufficient for 7B with other apps | Medium | 8G reservation; close other apps for 7B |
| Model download issues | Low | Provide direct HF links in README |
| OpenWebUI API compatibility | Low | llama.cpp v1 API is OpenAI-compatible |
| Intel AVX2 performance | Low | llama.cpp auto-detects and uses AVX2 |