# Plan: Local LLM Zima App (Intel NUC8)

## Context
- **Hardware**: Intel NUC8 i7, 16GB RAM, 500GB SSD
- **Goal**: Zima app for local LLM inference with web UI
- **Constraints**: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
- **Decisions**: 
  - Include OpenWebUI (two-container solution)
  - 8G memory reservation (allows 7B Q4 models)
  - App name: `llama-server`

---

## Technology Decision

### vLLM — **REJECTED**
- Requires NVIDIA CUDA GPU
- Cannot run on Intel NUC

### llama.cpp (llama-server) — **SELECTED**
- CPU-only, AVX2/AVX512 optimized
- Built-in REST API server
- Minimal footprint, fast for quantized models
- Best fit for NUC8 constraints

### LocalAI — **BACKUP OPTION**
- More features (TTS, image gen, multi-model)
- Can backend to llama.cpp
- Heavier; only choose if extra features needed

### OpenWebUI — **RECOMMENDED COMPANION**
- Modern chat UI for LLM
- Docker-based, easy to deploy alongside
- Can be separate Zima app or documented companion

---

## Architecture: Two Zima Apps

```
┌─────────────────────────┐     ┌─────────────────────────┐
│  llama-server           │     │  open-webui             │
│  - REST API :8080        │────▶│  - Chat UI :3000        │
│  - Serves model          │     │  - Connects to LLM API  │
└─────────────────────────┘     └─────────────────────────┘
```

Both are separate Zima apps, deployed independently. OpenWebUI references `http://llama-server:8080` via Docker internal networking.

### App 1: `llama-server`
- Container: `ghcr.io/ggerganov/llama.cpp:server`
- Port: 8080
- Memory: 8G reservation

### App 2: `open-webui`
- Container: `ghcr.io/open-webui/open-webui:main`
- Port: 3000
- Memory: 2G reservation
- Environment: `OLLAMA_BASE_URL=http://llama-server:8080`

---

## App: `llama-server`

### Container: `ghcr.io/ggerganov/llama.cpp:server`

**Environment Variables**:
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL` | (required) | Model filename in `/models` |
| `CTX_SIZE` | 2048 | Context window size |
| `N_THREADS` | auto | CPU threads (auto = all) |
| `HOST` | 0.0.0.0 | Listen address |
| `PORT` | 8080 | API port |
| `MAX_TOKENS` | 512 | Max tokens to generate |

**Volumes**:
| Container | Description |
|-----------|-------------|
| `/models` | Model files (GGUF format) |
| `/DATA/AppData/$AppID/logs` | Server logs |

**Ports**:
| Container | Protocol | Description |
|-----------|----------|-------------|
| 8080 | TCP | llama.cpp REST API |

**Resources**:
- Memory reservation: **8G** (allows 7B Q4 models)

**Security**:
- `security_opt: no-new-privileges:true`
- `cap_drop: ALL`
- No privileged needed (CPU-only)

### Model Download (Documented in README)
Users download models manually:
```bash
# Example: Download Llama 3.2 3B Q4_K_M
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
```

**Recommended Models for 16GB RAM**:
| Model | Size | Quant | RAM Needed | Speed (est) |
|-------|------|-------|------------|-------------|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |

---

## App: `open-webui`

### Container: `ghcr.io/open-webui/open-webui:main`

**Environment Variables**:
| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_BASE_URL` | http://llama-server:8080 | LLM API endpoint |
| `WEBUI_PORT` | 3000 | Web UI port |

**Ports**:
| Container | Protocol | Description |
|-----------|----------|-------------|
| 3000 | TCP | OpenWebUI |

**Resources**:
- Memory reservation: **2G**

**Notes**:
- Connects to `http://llama-server:8080` via Docker internal networking
- Requires `llama-server` app to be running first

---

## File Structure
```
Apps/llama-server/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)

Apps/open-webui/
├── docker-compose.yaml
├── README.md
└── HOW_TO_VERIFY.md (optional)
```

---

## Implementation Steps

### llama-server
1. Create `Apps/llama-server/` directory
2. Write `docker-compose.yaml` with:
   - Image: `ghcr.io/ggerganov/llama.cpp:server`
   - 8G memory reservation
   - Port 8080
   - Model volume at `/models`
   - Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
3. Write `README.md` with:
   - Model download instructions
   - First-run setup
   - API testing examples
   - Performance tips for NUC8
4. Validate with `./scripts/validate-appstore.sh`

### open-webui
1. Create `Apps/open-webui/` directory
2. Write `docker-compose.yaml` with:
   - Image: `ghcr.io/open-webui/open-webui:main`
   - 2G memory reservation
   - Port 3000
   - Environment: `OLLAMA_BASE_URL=http://llama-server:8080`
3. Write `README.md` with:
   - Prerequisites (llama-server must be running first)
   - How to access
   - Troubleshooting connection issues
4. Validate with `./scripts/validate-appstore.sh`

---

## Risk Assessment

| Risk | Level | Mitigation |
|------|-------|------------|
| NUC8 RAM insufficient for 7B with other apps | Medium | 8G reservation; close other apps for 7B |
| Model download issues | Low | Provide direct HF links in README |
| OpenWebUI API compatibility | Low | llama.cpp v1 API is OpenAI-compatible |
| Intel AVX2 performance | Low | llama.cpp auto-detects and uses AVX2 |