Add llama-server and open-webui apps for local LLM inference

- llama-server: llama.cpp REST API server, 8G memory, port 8080 - open-webui: Chat UI connecting to llama-server, 2G memory, port 3000 - Both include x-casaos metadata for ZimaOS app store - README with model download instructions and API examples
2026-04-19 22:25:22 +02:00
parent 231aba08b0
commit 0aabfc8a72
4 changed files with 309 additions and 0 deletions
@@ -0,0 +1,86 @@
 # Llama Server
 Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
 ## Purpose
 - **Port**: 8080 (TCP)
 - **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
 - **Category**: AI / LLM inference
 CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.
 ## Model Setup
 llama-server does not bundle models. You must download GGUF files manually.
 SSH into your ZimaOS device and run:
 ```bash
 # Create models directory
 mkdir -p /DATA/AppData/llama-server/models
 # Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
 curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
  "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
 ```
 ## Recommended Models for 16GB RAM
 | Model | Size | Quant | RAM Needed | Speed (est.) |
 |-------|------|-------|------------|--------------|
 | Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
 | Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
 | Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
 | Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
 For 7B models, close other apps to free RAM. 8G reservation leaves headroom.
 ## Environment Variables
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` |
 | `CTX_SIZE` | `2048` | Context window size (tokens) |
 | `N_THREADS` | `0` | CPU threads (0 = auto) |
 | `HOST` | `0.0.0.0` | Listen address |
 | `PORT` | `8080` | API port |
 | `MAX_TOKENS` | `512` | Max tokens per response |
 Change `MODEL` to match your downloaded file. Restart container after changing.
 ## API Testing
 Once running, test the API:
 ```bash
 # Check server info
 curl http://localhost:8080/v1/models
 # Chat completions (OpenAI-compatible)
 curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-3b-q4_k_m.gguf",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 128
  }'
 ```
 ## Volumes
 | Path | Description |
 |------|-------------|
 | `/models` | GGUF model files |
 | `/logs` | Server log output |
 ## Architecture
 - `amd64` (Intel/AMD x86_64)
 - `arm64` (Apple Silicon, ARM servers)
 ## Security
 - `security_opt: no-new-privileges:true`
 - `cap_drop: ALL`
 - CPU-only, no privileged access needed
@@ -0,0 +1,82 @@
 name: llama-server
 services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server
    container_name: llama-server
    restart: unless-stopped
    environment:
      TZ: Europe/Stockholm
      MODEL: llama-3.2-3b-q4_k_m.gguf
      CTX_SIZE: "2048"
      N_THREADS: "0"
      HOST: 0.0.0.0
      PORT: "8080"
      MAX_TOKENS: "512"
    ports:
      - target: 8080
        published: "8080"
        protocol: tcp
    volumes:
      - type: bind
        source: /DATA/AppData/$AppID/models
        target: /models
      - type: bind
        source: /DATA/AppData/$AppID/logs
        target: /logs
    deploy:
      resources:
        reservations:
          memory: 8G
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    x-casaos:
      envs:
        - container: MODEL
          description:
            en_us: Model filename inside /models (e.g. llama-3.2-3b-q4_k_m.gguf). Download GGUF files manually into /models.
        - container: CTX_SIZE
          description:
            en_us: Context window size in tokens
        - container: N_THREADS
          description:
            en_us: CPU threads (0 = auto-detect all cores)
        - container: MAX_TOKENS
          description:
            en_us: Maximum tokens to generate per response
        - container: TZ
          description:
            en_us: Timezone, for example Europe/Stockholm
      ports:
        - container: "8080"
          description:
            en_us: llama.cpp REST API port
      volumes:
        - container: /models
          description:
            en_us: Model GGUF files directory
        - container: /logs
          description:
            en_us: Server log output
 x-casaos:
  architectures:
    - amd64
    - arm64
  main: llama-server
  category: ai
  author: Joachim Friberg
  developer: Joachim Friberg
  icon: https://cdn.simpleicons.org/llama
  tagline:
    en_us: CPU-only LLM inference server with REST API
  description:
    en_us: >
      Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
      CPU-only with AVX2/AVX512 optimization. Requires manual model download.
  title:
    en_us: Llama Server
  index: /
  port_map: "8080"
@@ -0,0 +1,73 @@
 # OpenWebUI
 Modern chat web interface for local LLMs. Connects to llama-server via Docker internal networking.
 ## Purpose
 - **Port**: 3000 (TCP)
 - **Memory**: 2G reservation
 - **Category**: AI / LLM UI
 Requires the **llama-server** app to be running first. Connects to `http://llama-server:8080` internally.
 ## Prerequisites
 1. Deploy and start **llama-server** app first
 2. Download a GGUF model into llama-server's `/models` directory
 3. Ensure llama-server container is healthy
 ## Access
 Open in browser:
 ```
 http://<your-zimaos-host>:3000
 ```
 First run may take a moment to initialize.
 ## Environment Variables
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `OLLAMA_BASE_URL` | `http://llama-server:8080` | Internal URL to llama-server API |
 | `WEBUI_PORT` | `3000` | Container listen port |
 | `TZ` | `Europe/Stockholm` | Timezone |
 ## If Connection Fails
 1. Verify llama-server is running: `docker ps | grep llama-server`
 2. Check llama-server logs: `docker logs llama-server`
 3. Ensure llama-server MODEL env matches your downloaded file
 4. From ZimaOS shell, test connectivity:
   ```bash
   curl http://llama-server:8080/v1/models
   ```
 ## Volumes
 | Path | Description |
 |------|-------------|
 | `/app/backend/data` | OpenWebUI persistent data (chat history, settings) |
 ## Architecture
 - `amd64` (Intel/AMD x86_64)
 - `arm64` (Apple Silicon, ARM servers)
 ## Security
 - `security_opt: no-new-privileges:true`
 - `cap_drop: ALL`
 ## Troubleshooting
 **"Cannot connect to LLM" error in UI**
 - Verify llama-server is running before open-webui
 - Check that `OLLAMA_BASE_URL` is set to `http://llama-server:8080`
 - Verify model file exists in `/DATA/AppData/llama-server/models/`
 **Slow responses**
 - 7B models on CPU are limited by single-thread performance
 - 3B models recommended for interactive speeds (~15+ tok/s)
 - Close other apps to free RAM
@@ -0,0 +1,68 @@
 name: open-webui
 services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    environment:
      TZ: Europe/Stockholm
      OLLAMA_BASE_URL: http://llama-server:8080
      WEBUI_PORT: "3000"
    ports:
      - target: 3000
        published: "3000"
        protocol: tcp
    volumes:
      - type: bind
        source: /DATA/AppData/$AppID/data
        target: /app/backend/data
    deploy:
      resources:
        reservations:
          memory: 2G
    depends_on:
      - llama-server
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    x-casaos:
      envs:
        - container: OLLAMA_BASE_URL
          description:
            en_us: Internal URL to llama-server API (http://llama-server:8080)
        - container: WEBUI_PORT
          description:
            en_us: Web UI listen port inside container
        - container: TZ
          description:
            en_us: Timezone, for example Europe/Stockholm
      ports:
        - container: "3000"
          description:
            en_us: OpenWebUI web interface port
      volumes:
        - container: /app/backend/data
          description:
            en_us: OpenWebUI persistent data (chat history, settings)
 x-casaos:
  architectures:
    - amd64
    - arm64
  main: open-webui
  category: ai
  author: Joachim Friberg
  developer: Joachim Friberg
  icon: https://cdn.simpleicons.org/webui
  tagline:
    en_us: Modern chat UI for local LLMs
  description:
    en_us: >
      OpenWebUI provides a modern, feature-rich web interface for interacting with local LLMs.
      Connect to llama-server or any OpenAI-compatible API. Requires llama-server app to be running first.
  title:
    en_us: OpenWebUI
  index: /
  port_map: "3000"