Merge pull request 'Add llama-server and open-webui apps for local LLM inference' (#4) from llama-server+open-webui/initial/local-llm-inference into main

Reviewed-on: phirna/zima-apps#4
This commit is contained in:
2026-04-21 17:56:28 +02:00
5 changed files with 366 additions and 0 deletions
+55
View File
@@ -129,6 +129,61 @@ Sektionen "data att samla" ska minst täcka:
- loggar från berörda containers, - loggar från berörda containers,
- konkreta felobservationer (hostname, tidpunkt, förväntat vs faktiskt beteende). - konkreta felobservationer (hostname, tidpunkt, förväntat vs faktiskt beteende).
<<<<<<< HEAD
## 11) Release- och publiceringsarbetsflöde
### Steg 1: Branch
Skapa branch enligt format i sektion 8:
`<appnamn>/<initial|bugfix|update>/<detalj>`
### Steg 2: Verifiera images (innan commit)
Kontrollera att alla Docker-images är tillgängliga online. Scriptet `build-appstore-zip.sh` verifierar automatiskt -- kör det för att kontrollera, eller använd:
```bash
docker manifest inspect <image:tag@sha256:...>
```
### Steg 3: Validera lokalt
Kör validering innan commit:
```bash
./scripts/validate-appstore.sh
```
### Steg 4: Committa ändringar
- Små, reviewbara commits.
- Separera appfiler från `dist/`-filer.
- Commit-meddelande: rubrik + bulletpunkter.
### Steg 5: Bygg appstore-zip
```bash
./scripts/build-appstore-zip.sh
```
- Skapar `dist/phirna-appstore.zip`.
- Verifierar alla images online automatiskt.
- Genererar SHA256 checksum.
- Med `CI=true` eller `--strict-images` misslyckas bygget om en image saknas.
### Steg 6: Committa dist/
Separer commit för `dist/` från appfiler:
```bash
git add dist/ && git commit -m "Build appstore zip"
```
### Steg 7: Push och PR
```bash
git push -u origin <branch>
```
PR ska inkludera:
- Vilka app-id som påverkas.
- Säkerhetsrisk (låg/medel/hög).
- Högrisk-inställningar vid introduktion eller förändring.
## 11) Gitea Bot (mimir) ## 11) Gitea Bot (mimir)
För att kunna skapa branches, commits och PRs via tea-CLI: För att kunna skapa branches, commits och PRs via tea-CLI:
+88
View File
@@ -0,0 +1,88 @@
# Llama Server
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
**Image**: `ghcr.io/ggml-org/llama.cpp:server-b8840` (CPU-only, AVX2/AVX512)
## Purpose
- **Port**: 8080 (TCP)
- **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
- **Category**: AI / LLM inference
CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.
## Model Setup
llama-server does not bundle models. You must download GGUF files manually.
SSH into your ZimaOS device and run:
```bash
# Create models directory
mkdir -p /DATA/AppData/llama-server/models
# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
```
## Recommended Models for 16GB RAM
| Model | Size | Quant | RAM Needed | Speed (est.) |
|-------|------|-------|------------|--------------|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
For 7B models, close other apps to free RAM. 8G reservation leaves headroom.
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` |
| `CTX_SIZE` | `2048` | Context window size (tokens) |
| `N_THREADS` | `0` | CPU threads (0 = auto) |
| `HOST` | `0.0.0.0` | Listen address |
| `PORT` | `8080` | API port |
| `MAX_TOKENS` | `512` | Max tokens per response |
Change `MODEL` to match your downloaded file. Restart container after changing.
## API Testing
Once running, test the API:
```bash
# Check server info
curl http://localhost:8080/v1/models
# Chat completions (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-q4_k_m.gguf",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 128
}'
```
## Volumes
| Path | Description |
|------|-------------|
| `/models` | GGUF model files |
| `/logs` | Server log output |
## Architecture
- `amd64` (Intel/AMD x86_64)
- `arm64` (Apple Silicon, ARM servers)
## Security
- `security_opt: no-new-privileges:true`
- `cap_drop: ALL`
- CPU-only, no privileged access needed
+82
View File
@@ -0,0 +1,82 @@
name: llama-server
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-b8840@sha256:99d2554c4c8d5339649dde530056cf10771823d7cd983dbd0441da9c419976b1
container_name: llama-server
restart: unless-stopped
environment:
TZ: Europe/Stockholm
MODEL: llama-3.2-3b-q4_k_m.gguf
CTX_SIZE: "2048"
N_THREADS: "0"
HOST: 0.0.0.0
PORT: "8080"
MAX_TOKENS: "512"
ports:
- target: 8080
published: "8080"
protocol: tcp
volumes:
- type: bind
source: /DATA/AppData/$AppID/models
target: /models
- type: bind
source: /DATA/AppData/$AppID/logs
target: /logs
deploy:
resources:
reservations:
memory: 8G
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
x-casaos:
envs:
- container: MODEL
description:
en_us: Model filename inside /models (e.g. llama-3.2-3b-q4_k_m.gguf). Download GGUF files manually into /models.
- container: CTX_SIZE
description:
en_us: Context window size in tokens
- container: N_THREADS
description:
en_us: CPU threads (0 = auto-detect all cores)
- container: MAX_TOKENS
description:
en_us: Maximum tokens to generate per response
- container: TZ
description:
en_us: Timezone, for example Europe/Stockholm
ports:
- container: "8080"
description:
en_us: llama.cpp REST API port
volumes:
- container: /models
description:
en_us: Model GGUF files directory
- container: /logs
description:
en_us: Server log output
x-casaos:
architectures:
- amd64
- arm64
main: llama-server
category: ai
author: Joachim Friberg
developer: Joachim Friberg
icon: https://cdn.simpleicons.org/llama
tagline:
en_us: CPU-only LLM inference server with REST API
description:
en_us: >
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
CPU-only with AVX2/AVX512 optimization. Requires manual model download.
title:
en_us: Llama Server
index: /
port_map: "8080"
+73
View File
@@ -0,0 +1,73 @@
# OpenWebUI
Modern chat web interface for local LLMs. Connects to llama-server via Docker internal networking.
## Purpose
- **Port**: 3000 (TCP)
- **Memory**: 2G reservation
- **Category**: AI / LLM UI
Requires the **llama-server** app to be running first. Connects to `http://llama-server:8080` internally.
## Prerequisites
1. Deploy and start **llama-server** app first
2. Download a GGUF model into llama-server's `/models` directory
3. Ensure llama-server container is healthy
## Access
Open in browser:
```
http://<your-zimaos-host>:3000
```
First run may take a moment to initialize.
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_BASE_URL` | `http://llama-server:8080` | Internal URL to llama-server API |
| `WEBUI_PORT` | `3000` | Container listen port |
| `TZ` | `Europe/Stockholm` | Timezone |
## If Connection Fails
1. Verify llama-server is running: `docker ps | grep llama-server`
2. Check llama-server logs: `docker logs llama-server`
3. Ensure llama-server MODEL env matches your downloaded file
4. From ZimaOS shell, test connectivity:
```bash
curl http://llama-server:8080/v1/models
```
## Volumes
| Path | Description |
|------|-------------|
| `/app/backend/data` | OpenWebUI persistent data (chat history, settings) |
## Architecture
- `amd64` (Intel/AMD x86_64)
- `arm64` (Apple Silicon, ARM servers)
## Security
- `security_opt: no-new-privileges:true`
- `cap_drop: ALL`
## Troubleshooting
**"Cannot connect to LLM" error in UI**
- Verify llama-server is running before open-webui
- Check that `OLLAMA_BASE_URL` is set to `http://llama-server:8080`
- Verify model file exists in `/DATA/AppData/llama-server/models/`
**Slow responses**
- 7B models on CPU are limited by single-thread performance
- 3B models recommended for interactive speeds (~15+ tok/s)
- Close other apps to free RAM
+68
View File
@@ -0,0 +1,68 @@
name: open-webui
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
environment:
TZ: Europe/Stockholm
OLLAMA_BASE_URL: http://llama-server:8080
WEBUI_PORT: "3000"
ports:
- target: 3000
published: "3000"
protocol: tcp
volumes:
- type: bind
source: /DATA/AppData/$AppID/data
target: /app/backend/data
deploy:
resources:
reservations:
memory: 2G
depends_on:
- llama-server
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
x-casaos:
envs:
- container: OLLAMA_BASE_URL
description:
en_us: Internal URL to llama-server API (http://llama-server:8080)
- container: WEBUI_PORT
description:
en_us: Web UI listen port inside container
- container: TZ
description:
en_us: Timezone, for example Europe/Stockholm
ports:
- container: "3000"
description:
en_us: OpenWebUI web interface port
volumes:
- container: /app/backend/data
description:
en_us: OpenWebUI persistent data (chat history, settings)
x-casaos:
architectures:
- amd64
- arm64
main: open-webui
category: ai
author: Joachim Friberg
developer: Joachim Friberg
icon: https://cdn.simpleicons.org/webui
tagline:
en_us: Modern chat UI for local LLMs
description:
en_us: >
OpenWebUI provides a modern, feature-rich web interface for interacting with local LLMs.
Connect to llama-server or any OpenAI-compatible API. Requires llama-server app to be running first.
title:
en_us: OpenWebUI
index: /
port_map: "3000"