Add Gitea bot (mimir) configuration for tea-CLI operations

- Add AGENTS.md section 11 documenting mimir bot user for Gitea - Store token via tea logins system with repository:write and user:read scopes - Document common tea commands for branch/PR creation and issue management - Enable agents to create branches, commits and PRs via tea-CLI
2026-04-20 13:24:57 +02:00
parent 231aba08b0
commit 42a5d231b8
5 changed files with 599 additions and 1 deletions
@@ -0,0 +1,192 @@
+# Plan: Local LLM Zima App (Intel NUC8)
+
+## Context
+- **Hardware**: Intel NUC8 i7, 16GB RAM, 500GB SSD
+- **Goal**: Zima app for local LLM inference with web UI
+- **Constraints**: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
+- **Decisions**: 
+  - Include OpenWebUI (two-container solution)
+  - 8G memory reservation (allows 7B Q4 models)
+  - App name: `llama-server`
+
+---
+
+## Technology Decision
+
+### vLLM — **REJECTED**
+- Requires NVIDIA CUDA GPU
+- Cannot run on Intel NUC
+
+### llama.cpp (llama-server) — **SELECTED**
+- CPU-only, AVX2/AVX512 optimized
+- Built-in REST API server
+- Minimal footprint, fast for quantized models
+- Best fit for NUC8 constraints
+
+### LocalAI — **BACKUP OPTION**
+- More features (TTS, image gen, multi-model)
+- Can backend to llama.cpp
+- Heavier; only choose if extra features needed
+
+### OpenWebUI — **RECOMMENDED COMPANION**
+- Modern chat UI for LLM
+- Docker-based, easy to deploy alongside
+- Can be separate Zima app or documented companion
+
+---
+
+## Architecture: Two Zima Apps
+
+```
+┌─────────────────────────┐     ┌─────────────────────────┐
+│  llama-server           │     │  open-webui             │
+│  - REST API :8080        │────▶│  - Chat UI :3000        │
+│  - Serves model          │     │  - Connects to LLM API  │
+└─────────────────────────┘     └─────────────────────────┘
+```
+
+Both are separate Zima apps, deployed independently. OpenWebUI references `http://llama-server:8080` via Docker internal networking.
+
+### App 1: `llama-server`
+- Container: `ghcr.io/ggerganov/llama.cpp:server`
+- Port: 8080
+- Memory: 8G reservation
+
+### App 2: `open-webui`
+- Container: `ghcr.io/open-webui/open-webui:main`
+- Port: 3000
+- Memory: 2G reservation
+- Environment: `OLLAMA_BASE_URL=http://llama-server:8080`
+
+---
+
+## App: `llama-server`
+
+### Container: `ghcr.io/ggerganov/llama.cpp:server`
+
+**Environment Variables**:
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MODEL` | (required) | Model filename in `/models` |
+| `CTX_SIZE` | 2048 | Context window size |
+| `N_THREADS` | auto | CPU threads (auto = all) |
+| `HOST` | 0.0.0.0 | Listen address |
+| `PORT` | 8080 | API port |
+| `MAX_TOKENS` | 512 | Max tokens to generate |
+
+**Volumes**:
+| Container | Description |
+|-----------|-------------|
+| `/models` | Model files (GGUF format) |
+| `/DATA/AppData/$AppID/logs` | Server logs |
+
+**Ports**:
+| Container | Protocol | Description |
+|-----------|----------|-------------|
+| 8080 | TCP | llama.cpp REST API |
+
+**Resources**:
+- Memory reservation: **8G** (allows 7B Q4 models)
+
+**Security**:
+- `security_opt: no-new-privileges:true`
+- `cap_drop: ALL`
+- No privileged needed (CPU-only)
+
+### Model Download (Documented in README)
+Users download models manually:
+```bash
+# Example: Download Llama 3.2 3B Q4_K_M
+curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
+  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
+```
+
+**Recommended Models for 16GB RAM**:
+| Model | Size | Quant | RAM Needed | Speed (est) |
+|-------|------|-------|------------|-------------|
+| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
+| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
+| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
+| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
+
+---
+
+## App: `open-webui`
+
+### Container: `ghcr.io/open-webui/open-webui:main`
+
+**Environment Variables**:
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `OLLAMA_BASE_URL` | http://llama-server:8080 | LLM API endpoint |
+| `WEBUI_PORT` | 3000 | Web UI port |
+
+**Ports**:
+| Container | Protocol | Description |
+|-----------|----------|-------------|
+| 3000 | TCP | OpenWebUI |
+
+**Resources**:
+- Memory reservation: **2G**
+
+**Notes**:
+- Connects to `http://llama-server:8080` via Docker internal networking
+- Requires `llama-server` app to be running first
+
+---
+
+## File Structure
+```
+Apps/llama-server/
+├── docker-compose.yaml
+├── README.md
+└── HOW_TO_VERIFY.md (optional)
+
+Apps/open-webui/
+├── docker-compose.yaml
+├── README.md
+└── HOW_TO_VERIFY.md (optional)
+```
+
+---
+
+## Implementation Steps
+
+### llama-server
+1. Create `Apps/llama-server/` directory
+2. Write `docker-compose.yaml` with:
+   - Image: `ghcr.io/ggerganov/llama.cpp:server`
+   - 8G memory reservation
+   - Port 8080
+   - Model volume at `/models`
+   - Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
+3. Write `README.md` with:
+   - Model download instructions
+   - First-run setup
+   - API testing examples
+   - Performance tips for NUC8
+4. Validate with `./scripts/validate-appstore.sh`
+
+### open-webui
+1. Create `Apps/open-webui/` directory
+2. Write `docker-compose.yaml` with:
+   - Image: `ghcr.io/open-webui/open-webui:main`
+   - 2G memory reservation
+   - Port 3000
+   - Environment: `OLLAMA_BASE_URL=http://llama-server:8080`
+3. Write `README.md` with:
+   - Prerequisites (llama-server must be running first)
+   - How to access
+   - Troubleshooting connection issues
+4. Validate with `./scripts/validate-appstore.sh`
+
+---
+
+## Risk Assessment
+
+| Risk | Level | Mitigation |
+|------|-------|------------|
+| NUC8 RAM insufficient for 7B with other apps | Medium | 8G reservation; close other apps for 7B |
+| Model download issues | Low | Provide direct HF links in README |
+| OpenWebUI API compatibility | Low | llama.cpp v1 API is OpenAI-compatible |
+| Intel AVX2 performance | Low | llama.cpp auto-detects and uses AVX2 |