# Plan: Local LLM Zima App (Intel NUC8) ## Context - **Hardware**: Intel NUC8 i7, 16GB RAM, 500GB SSD - **Goal**: Zima app for local LLM inference with web UI - **Constraints**: Intel Iris GPU cannot be used for LLM offload; CPU-only inference - **Decisions**: - Include OpenWebUI (two-container solution) - 8G memory reservation (allows 7B Q4 models) - App name: `llama-server` --- ## Technology Decision ### vLLM — **REJECTED** - Requires NVIDIA CUDA GPU - Cannot run on Intel NUC ### llama.cpp (llama-server) — **SELECTED** - CPU-only, AVX2/AVX512 optimized - Built-in REST API server - Minimal footprint, fast for quantized models - Best fit for NUC8 constraints ### LocalAI — **BACKUP OPTION** - More features (TTS, image gen, multi-model) - Can backend to llama.cpp - Heavier; only choose if extra features needed ### OpenWebUI — **RECOMMENDED COMPANION** - Modern chat UI for LLM - Docker-based, easy to deploy alongside - Can be separate Zima app or documented companion --- ## Architecture: Two Zima Apps ``` ┌─────────────────────────┐ ┌─────────────────────────┐ │ llama-server │ │ open-webui │ │ - REST API :8080 │────▶│ - Chat UI :3000 │ │ - Serves model │ │ - Connects to LLM API │ └─────────────────────────┘ └─────────────────────────┘ ``` Both are separate Zima apps, deployed independently. OpenWebUI references `http://llama-server:8080` via Docker internal networking. ### App 1: `llama-server` - Container: `ghcr.io/ggerganov/llama.cpp:server` - Port: 8080 - Memory: 8G reservation ### App 2: `open-webui` - Container: `ghcr.io/open-webui/open-webui:main` - Port: 3000 - Memory: 2G reservation - Environment: `OLLAMA_BASE_URL=http://llama-server:8080` --- ## App: `llama-server` ### Container: `ghcr.io/ggerganov/llama.cpp:server` **Environment Variables**: | Variable | Default | Description | |----------|---------|-------------| | `MODEL` | (required) | Model filename in `/models` | | `CTX_SIZE` | 2048 | Context window size | | `N_THREADS` | auto | CPU threads (auto = all) | | `HOST` | 0.0.0.0 | Listen address | | `PORT` | 8080 | API port | | `MAX_TOKENS` | 512 | Max tokens to generate | **Volumes**: | Container | Description | |-----------|-------------| | `/models` | Model files (GGUF format) | | `/DATA/AppData/$AppID/logs` | Server logs | **Ports**: | Container | Protocol | Description | |-----------|----------|-------------| | 8080 | TCP | llama.cpp REST API | **Resources**: - Memory reservation: **8G** (allows 7B Q4 models) **Security**: - `security_opt: no-new-privileges:true` - `cap_drop: ALL` - No privileged needed (CPU-only) ### Model Download (Documented in README) Users download models manually: ```bash # Example: Download Llama 3.2 3B Q4_K_M curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \ "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf" ``` **Recommended Models for 16GB RAM**: | Model | Size | Quant | RAM Needed | Speed (est) | |-------|------|-------|------------|-------------| | Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | | Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | | Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | | Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | --- ## App: `open-webui` ### Container: `ghcr.io/open-webui/open-webui:main` **Environment Variables**: | Variable | Default | Description | |----------|---------|-------------| | `OLLAMA_BASE_URL` | http://llama-server:8080 | LLM API endpoint | | `WEBUI_PORT` | 3000 | Web UI port | **Ports**: | Container | Protocol | Description | |-----------|----------|-------------| | 3000 | TCP | OpenWebUI | **Resources**: - Memory reservation: **2G** **Notes**: - Connects to `http://llama-server:8080` via Docker internal networking - Requires `llama-server` app to be running first --- ## File Structure ``` Apps/llama-server/ ├── docker-compose.yaml ├── README.md └── HOW_TO_VERIFY.md (optional) Apps/open-webui/ ├── docker-compose.yaml ├── README.md └── HOW_TO_VERIFY.md (optional) ``` --- ## Implementation Steps ### llama-server 1. Create `Apps/llama-server/` directory 2. Write `docker-compose.yaml` with: - Image: `ghcr.io/ggerganov/llama.cpp:server` - 8G memory reservation - Port 8080 - Model volume at `/models` - Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT 3. Write `README.md` with: - Model download instructions - First-run setup - API testing examples - Performance tips for NUC8 4. Validate with `./scripts/validate-appstore.sh` ### open-webui 1. Create `Apps/open-webui/` directory 2. Write `docker-compose.yaml` with: - Image: `ghcr.io/open-webui/open-webui:main` - 2G memory reservation - Port 3000 - Environment: `OLLAMA_BASE_URL=http://llama-server:8080` 3. Write `README.md` with: - Prerequisites (llama-server must be running first) - How to access - Troubleshooting connection issues 4. Validate with `./scripts/validate-appstore.sh` --- ## Risk Assessment | Risk | Level | Mitigation | |------|-------|------------| | NUC8 RAM insufficient for 7B with other apps | Medium | 8G reservation; close other apps for 7B | | Model download issues | Low | Provide direct HF links in README | | OpenWebUI API compatibility | Low | llama.cpp v1 API is OpenAI-compatible | | Intel AVX2 performance | Low | llama.cpp auto-detects and uses AVX2 |