Add Gitea bot (mimir) configuration for tea-CLI operations
- Add AGENTS.md section 11 documenting mimir bot user for Gitea - Store token via tea logins system with repository:write and user:read scopes - Document common tea commands for branch/PR creation and issue management - Enable agents to create branches, commits and PRs via tea-CLI
This commit is contained in:
@@ -0,0 +1,192 @@
|
||||
# Plan: Local LLM Zima App (Intel NUC8)
|
||||
|
||||
## Context
|
||||
- **Hardware**: Intel NUC8 i7, 16GB RAM, 500GB SSD
|
||||
- **Goal**: Zima app for local LLM inference with web UI
|
||||
- **Constraints**: Intel Iris GPU cannot be used for LLM offload; CPU-only inference
|
||||
- **Decisions**:
|
||||
- Include OpenWebUI (two-container solution)
|
||||
- 8G memory reservation (allows 7B Q4 models)
|
||||
- App name: `llama-server`
|
||||
|
||||
---
|
||||
|
||||
## Technology Decision
|
||||
|
||||
### vLLM — **REJECTED**
|
||||
- Requires NVIDIA CUDA GPU
|
||||
- Cannot run on Intel NUC
|
||||
|
||||
### llama.cpp (llama-server) — **SELECTED**
|
||||
- CPU-only, AVX2/AVX512 optimized
|
||||
- Built-in REST API server
|
||||
- Minimal footprint, fast for quantized models
|
||||
- Best fit for NUC8 constraints
|
||||
|
||||
### LocalAI — **BACKUP OPTION**
|
||||
- More features (TTS, image gen, multi-model)
|
||||
- Can backend to llama.cpp
|
||||
- Heavier; only choose if extra features needed
|
||||
|
||||
### OpenWebUI — **RECOMMENDED COMPANION**
|
||||
- Modern chat UI for LLM
|
||||
- Docker-based, easy to deploy alongside
|
||||
- Can be separate Zima app or documented companion
|
||||
|
||||
---
|
||||
|
||||
## Architecture: Two Zima Apps
|
||||
|
||||
```
|
||||
┌─────────────────────────┐ ┌─────────────────────────┐
|
||||
│ llama-server │ │ open-webui │
|
||||
│ - REST API :8080 │────▶│ - Chat UI :3000 │
|
||||
│ - Serves model │ │ - Connects to LLM API │
|
||||
└─────────────────────────┘ └─────────────────────────┘
|
||||
```
|
||||
|
||||
Both are separate Zima apps, deployed independently. OpenWebUI references `http://llama-server:8080` via Docker internal networking.
|
||||
|
||||
### App 1: `llama-server`
|
||||
- Container: `ghcr.io/ggerganov/llama.cpp:server`
|
||||
- Port: 8080
|
||||
- Memory: 8G reservation
|
||||
|
||||
### App 2: `open-webui`
|
||||
- Container: `ghcr.io/open-webui/open-webui:main`
|
||||
- Port: 3000
|
||||
- Memory: 2G reservation
|
||||
- Environment: `OLLAMA_BASE_URL=http://llama-server:8080`
|
||||
|
||||
---
|
||||
|
||||
## App: `llama-server`
|
||||
|
||||
### Container: `ghcr.io/ggerganov/llama.cpp:server`
|
||||
|
||||
**Environment Variables**:
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `MODEL` | (required) | Model filename in `/models` |
|
||||
| `CTX_SIZE` | 2048 | Context window size |
|
||||
| `N_THREADS` | auto | CPU threads (auto = all) |
|
||||
| `HOST` | 0.0.0.0 | Listen address |
|
||||
| `PORT` | 8080 | API port |
|
||||
| `MAX_TOKENS` | 512 | Max tokens to generate |
|
||||
|
||||
**Volumes**:
|
||||
| Container | Description |
|
||||
|-----------|-------------|
|
||||
| `/models` | Model files (GGUF format) |
|
||||
| `/DATA/AppData/$AppID/logs` | Server logs |
|
||||
|
||||
**Ports**:
|
||||
| Container | Protocol | Description |
|
||||
|-----------|----------|-------------|
|
||||
| 8080 | TCP | llama.cpp REST API |
|
||||
|
||||
**Resources**:
|
||||
- Memory reservation: **8G** (allows 7B Q4 models)
|
||||
|
||||
**Security**:
|
||||
- `security_opt: no-new-privileges:true`
|
||||
- `cap_drop: ALL`
|
||||
- No privileged needed (CPU-only)
|
||||
|
||||
### Model Download (Documented in README)
|
||||
Users download models manually:
|
||||
```bash
|
||||
# Example: Download Llama 3.2 3B Q4_K_M
|
||||
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
|
||||
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
|
||||
```
|
||||
|
||||
**Recommended Models for 16GB RAM**:
|
||||
| Model | Size | Quant | RAM Needed | Speed (est) |
|
||||
|-------|------|-------|------------|-------------|
|
||||
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
||||
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
||||
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
||||
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
||||
|
||||
---
|
||||
|
||||
## App: `open-webui`
|
||||
|
||||
### Container: `ghcr.io/open-webui/open-webui:main`
|
||||
|
||||
**Environment Variables**:
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `OLLAMA_BASE_URL` | http://llama-server:8080 | LLM API endpoint |
|
||||
| `WEBUI_PORT` | 3000 | Web UI port |
|
||||
|
||||
**Ports**:
|
||||
| Container | Protocol | Description |
|
||||
|-----------|----------|-------------|
|
||||
| 3000 | TCP | OpenWebUI |
|
||||
|
||||
**Resources**:
|
||||
- Memory reservation: **2G**
|
||||
|
||||
**Notes**:
|
||||
- Connects to `http://llama-server:8080` via Docker internal networking
|
||||
- Requires `llama-server` app to be running first
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
```
|
||||
Apps/llama-server/
|
||||
├── docker-compose.yaml
|
||||
├── README.md
|
||||
└── HOW_TO_VERIFY.md (optional)
|
||||
|
||||
Apps/open-webui/
|
||||
├── docker-compose.yaml
|
||||
├── README.md
|
||||
└── HOW_TO_VERIFY.md (optional)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### llama-server
|
||||
1. Create `Apps/llama-server/` directory
|
||||
2. Write `docker-compose.yaml` with:
|
||||
- Image: `ghcr.io/ggerganov/llama.cpp:server`
|
||||
- 8G memory reservation
|
||||
- Port 8080
|
||||
- Model volume at `/models`
|
||||
- Env vars: MODEL, CTX_SIZE, N_THREADS, HOST, PORT
|
||||
3. Write `README.md` with:
|
||||
- Model download instructions
|
||||
- First-run setup
|
||||
- API testing examples
|
||||
- Performance tips for NUC8
|
||||
4. Validate with `./scripts/validate-appstore.sh`
|
||||
|
||||
### open-webui
|
||||
1. Create `Apps/open-webui/` directory
|
||||
2. Write `docker-compose.yaml` with:
|
||||
- Image: `ghcr.io/open-webui/open-webui:main`
|
||||
- 2G memory reservation
|
||||
- Port 3000
|
||||
- Environment: `OLLAMA_BASE_URL=http://llama-server:8080`
|
||||
3. Write `README.md` with:
|
||||
- Prerequisites (llama-server must be running first)
|
||||
- How to access
|
||||
- Troubleshooting connection issues
|
||||
4. Validate with `./scripts/validate-appstore.sh`
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Level | Mitigation |
|
||||
|------|-------|------------|
|
||||
| NUC8 RAM insufficient for 7B with other apps | Medium | 8G reservation; close other apps for 7B |
|
||||
| Model download issues | Low | Provide direct HF links in README |
|
||||
| OpenWebUI API compatibility | Low | llama.cpp v1 API is OpenAI-compatible |
|
||||
| Intel AVX2 performance | Low | llama.cpp auto-detects and uses AVX2 |
|
||||
Reference in New Issue
Block a user