diff --git a/Apps/llama-server/README.md b/Apps/llama-server/README.md new file mode 100644 index 0000000..18cb37d --- /dev/null +++ b/Apps/llama-server/README.md @@ -0,0 +1,86 @@ +# Llama Server + +Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API. + +## Purpose + +- **Port**: 8080 (TCP) +- **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM) +- **Category**: AI / LLM inference + +CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed. + +## Model Setup + +llama-server does not bundle models. You must download GGUF files manually. + +SSH into your ZimaOS device and run: + +```bash +# Create models directory +mkdir -p /DATA/AppData/llama-server/models + +# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB) +curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \ + "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf" +``` + +## Recommended Models for 16GB RAM + +| Model | Size | Quant | RAM Needed | Speed (est.) | +|-------|------|-------|------------|--------------| +| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | +| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | +| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | +| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | + +For 7B models, close other apps to free RAM. 8G reservation leaves headroom. + +## Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` | +| `CTX_SIZE` | `2048` | Context window size (tokens) | +| `N_THREADS` | `0` | CPU threads (0 = auto) | +| `HOST` | `0.0.0.0` | Listen address | +| `PORT` | `8080` | API port | +| `MAX_TOKENS` | `512` | Max tokens per response | + +Change `MODEL` to match your downloaded file. Restart container after changing. + +## API Testing + +Once running, test the API: + +```bash +# Check server info +curl http://localhost:8080/v1/models + +# Chat completions (OpenAI-compatible) +curl http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama-3.2-3b-q4_k_m.gguf", + "messages": [{"role": "user", "content": "Hello, who are you?"}], + "max_tokens": 128 + }' +``` + +## Volumes + +| Path | Description | +|------|-------------| +| `/models` | GGUF model files | +| `/logs` | Server log output | + +## Architecture + +- `amd64` (Intel/AMD x86_64) +- `arm64` (Apple Silicon, ARM servers) + +## Security + +- `security_opt: no-new-privileges:true` +- `cap_drop: ALL` +- CPU-only, no privileged access needed diff --git a/Apps/llama-server/docker-compose.yaml b/Apps/llama-server/docker-compose.yaml new file mode 100644 index 0000000..6cc2126 --- /dev/null +++ b/Apps/llama-server/docker-compose.yaml @@ -0,0 +1,82 @@ +name: llama-server + +services: + llama-server: + image: ghcr.io/ggerganov/llama.cpp:server + container_name: llama-server + restart: unless-stopped + environment: + TZ: Europe/Stockholm + MODEL: llama-3.2-3b-q4_k_m.gguf + CTX_SIZE: "2048" + N_THREADS: "0" + HOST: 0.0.0.0 + PORT: "8080" + MAX_TOKENS: "512" + ports: + - target: 8080 + published: "8080" + protocol: tcp + volumes: + - type: bind + source: /DATA/AppData/$AppID/models + target: /models + - type: bind + source: /DATA/AppData/$AppID/logs + target: /logs + deploy: + resources: + reservations: + memory: 8G + security_opt: + - no-new-privileges:true + cap_drop: + - ALL + x-casaos: + envs: + - container: MODEL + description: + en_us: Model filename inside /models (e.g. llama-3.2-3b-q4_k_m.gguf). Download GGUF files manually into /models. + - container: CTX_SIZE + description: + en_us: Context window size in tokens + - container: N_THREADS + description: + en_us: CPU threads (0 = auto-detect all cores) + - container: MAX_TOKENS + description: + en_us: Maximum tokens to generate per response + - container: TZ + description: + en_us: Timezone, for example Europe/Stockholm + ports: + - container: "8080" + description: + en_us: llama.cpp REST API port + volumes: + - container: /models + description: + en_us: Model GGUF files directory + - container: /logs + description: + en_us: Server log output + +x-casaos: + architectures: + - amd64 + - arm64 + main: llama-server + category: ai + author: Joachim Friberg + developer: Joachim Friberg + icon: https://cdn.simpleicons.org/llama + tagline: + en_us: CPU-only LLM inference server with REST API + description: + en_us: > + Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API. + CPU-only with AVX2/AVX512 optimization. Requires manual model download. + title: + en_us: Llama Server + index: / + port_map: "8080" diff --git a/Apps/open-webui/README.md b/Apps/open-webui/README.md new file mode 100644 index 0000000..a0fdb86 --- /dev/null +++ b/Apps/open-webui/README.md @@ -0,0 +1,73 @@ +# OpenWebUI + +Modern chat web interface for local LLMs. Connects to llama-server via Docker internal networking. + +## Purpose + +- **Port**: 3000 (TCP) +- **Memory**: 2G reservation +- **Category**: AI / LLM UI + +Requires the **llama-server** app to be running first. Connects to `http://llama-server:8080` internally. + +## Prerequisites + +1. Deploy and start **llama-server** app first +2. Download a GGUF model into llama-server's `/models` directory +3. Ensure llama-server container is healthy + +## Access + +Open in browser: + +``` +http://:3000 +``` + +First run may take a moment to initialize. + +## Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `OLLAMA_BASE_URL` | `http://llama-server:8080` | Internal URL to llama-server API | +| `WEBUI_PORT` | `3000` | Container listen port | +| `TZ` | `Europe/Stockholm` | Timezone | + +## If Connection Fails + +1. Verify llama-server is running: `docker ps | grep llama-server` +2. Check llama-server logs: `docker logs llama-server` +3. Ensure llama-server MODEL env matches your downloaded file +4. From ZimaOS shell, test connectivity: + ```bash + curl http://llama-server:8080/v1/models + ``` + +## Volumes + +| Path | Description | +|------|-------------| +| `/app/backend/data` | OpenWebUI persistent data (chat history, settings) | + +## Architecture + +- `amd64` (Intel/AMD x86_64) +- `arm64` (Apple Silicon, ARM servers) + +## Security + +- `security_opt: no-new-privileges:true` +- `cap_drop: ALL` + +## Troubleshooting + +**"Cannot connect to LLM" error in UI** +- Verify llama-server is running before open-webui +- Check that `OLLAMA_BASE_URL` is set to `http://llama-server:8080` +- Verify model file exists in `/DATA/AppData/llama-server/models/` + +**Slow responses** +- 7B models on CPU are limited by single-thread performance +- 3B models recommended for interactive speeds (~15+ tok/s) +- Close other apps to free RAM diff --git a/Apps/open-webui/docker-compose.yaml b/Apps/open-webui/docker-compose.yaml new file mode 100644 index 0000000..8569bb3 --- /dev/null +++ b/Apps/open-webui/docker-compose.yaml @@ -0,0 +1,68 @@ +name: open-webui + +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + container_name: open-webui + restart: unless-stopped + environment: + TZ: Europe/Stockholm + OLLAMA_BASE_URL: http://llama-server:8080 + WEBUI_PORT: "3000" + ports: + - target: 3000 + published: "3000" + protocol: tcp + volumes: + - type: bind + source: /DATA/AppData/$AppID/data + target: /app/backend/data + deploy: + resources: + reservations: + memory: 2G + depends_on: + - llama-server + security_opt: + - no-new-privileges:true + cap_drop: + - ALL + x-casaos: + envs: + - container: OLLAMA_BASE_URL + description: + en_us: Internal URL to llama-server API (http://llama-server:8080) + - container: WEBUI_PORT + description: + en_us: Web UI listen port inside container + - container: TZ + description: + en_us: Timezone, for example Europe/Stockholm + ports: + - container: "3000" + description: + en_us: OpenWebUI web interface port + volumes: + - container: /app/backend/data + description: + en_us: OpenWebUI persistent data (chat history, settings) + +x-casaos: + architectures: + - amd64 + - arm64 + main: open-webui + category: ai + author: Joachim Friberg + developer: Joachim Friberg + icon: https://cdn.simpleicons.org/webui + tagline: + en_us: Modern chat UI for local LLMs + description: + en_us: > + OpenWebUI provides a modern, feature-rich web interface for interacting with local LLMs. + Connect to llama-server or any OpenAI-compatible API. Requires llama-server app to be running first. + title: + en_us: OpenWebUI + index: / + port_map: "3000"