Add llama-server and open-webui apps for local LLM inference
- llama-server: llama.cpp REST API server, 8G memory, port 8080 - open-webui: Chat UI connecting to llama-server, 2G memory, port 3000 - Both include x-casaos metadata for ZimaOS app store - README with model download instructions and API examples
This commit is contained in:
@@ -0,0 +1,86 @@
|
|||||||
|
# Llama Server
|
||||||
|
|
||||||
|
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
- **Port**: 8080 (TCP)
|
||||||
|
- **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
|
||||||
|
- **Category**: AI / LLM inference
|
||||||
|
|
||||||
|
CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.
|
||||||
|
|
||||||
|
## Model Setup
|
||||||
|
|
||||||
|
llama-server does not bundle models. You must download GGUF files manually.
|
||||||
|
|
||||||
|
SSH into your ZimaOS device and run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create models directory
|
||||||
|
mkdir -p /DATA/AppData/llama-server/models
|
||||||
|
|
||||||
|
# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
|
||||||
|
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
|
||||||
|
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recommended Models for 16GB RAM
|
||||||
|
|
||||||
|
| Model | Size | Quant | RAM Needed | Speed (est.) |
|
||||||
|
|-------|------|-------|------------|--------------|
|
||||||
|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
||||||
|
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
||||||
|
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
||||||
|
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
||||||
|
|
||||||
|
For 7B models, close other apps to free RAM. 8G reservation leaves headroom.
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
| Variable | Default | Description |
|
||||||
|
|----------|---------|-------------|
|
||||||
|
| `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` |
|
||||||
|
| `CTX_SIZE` | `2048` | Context window size (tokens) |
|
||||||
|
| `N_THREADS` | `0` | CPU threads (0 = auto) |
|
||||||
|
| `HOST` | `0.0.0.0` | Listen address |
|
||||||
|
| `PORT` | `8080` | API port |
|
||||||
|
| `MAX_TOKENS` | `512` | Max tokens per response |
|
||||||
|
|
||||||
|
Change `MODEL` to match your downloaded file. Restart container after changing.
|
||||||
|
|
||||||
|
## API Testing
|
||||||
|
|
||||||
|
Once running, test the API:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check server info
|
||||||
|
curl http://localhost:8080/v1/models
|
||||||
|
|
||||||
|
# Chat completions (OpenAI-compatible)
|
||||||
|
curl http://localhost:8080/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "llama-3.2-3b-q4_k_m.gguf",
|
||||||
|
"messages": [{"role": "user", "content": "Hello, who are you?"}],
|
||||||
|
"max_tokens": 128
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Volumes
|
||||||
|
|
||||||
|
| Path | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `/models` | GGUF model files |
|
||||||
|
| `/logs` | Server log output |
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- `amd64` (Intel/AMD x86_64)
|
||||||
|
- `arm64` (Apple Silicon, ARM servers)
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- `security_opt: no-new-privileges:true`
|
||||||
|
- `cap_drop: ALL`
|
||||||
|
- CPU-only, no privileged access needed
|
||||||
@@ -0,0 +1,82 @@
|
|||||||
|
name: llama-server
|
||||||
|
|
||||||
|
services:
|
||||||
|
llama-server:
|
||||||
|
image: ghcr.io/ggerganov/llama.cpp:server
|
||||||
|
container_name: llama-server
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
TZ: Europe/Stockholm
|
||||||
|
MODEL: llama-3.2-3b-q4_k_m.gguf
|
||||||
|
CTX_SIZE: "2048"
|
||||||
|
N_THREADS: "0"
|
||||||
|
HOST: 0.0.0.0
|
||||||
|
PORT: "8080"
|
||||||
|
MAX_TOKENS: "512"
|
||||||
|
ports:
|
||||||
|
- target: 8080
|
||||||
|
published: "8080"
|
||||||
|
protocol: tcp
|
||||||
|
volumes:
|
||||||
|
- type: bind
|
||||||
|
source: /DATA/AppData/$AppID/models
|
||||||
|
target: /models
|
||||||
|
- type: bind
|
||||||
|
source: /DATA/AppData/$AppID/logs
|
||||||
|
target: /logs
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
memory: 8G
|
||||||
|
security_opt:
|
||||||
|
- no-new-privileges:true
|
||||||
|
cap_drop:
|
||||||
|
- ALL
|
||||||
|
x-casaos:
|
||||||
|
envs:
|
||||||
|
- container: MODEL
|
||||||
|
description:
|
||||||
|
en_us: Model filename inside /models (e.g. llama-3.2-3b-q4_k_m.gguf). Download GGUF files manually into /models.
|
||||||
|
- container: CTX_SIZE
|
||||||
|
description:
|
||||||
|
en_us: Context window size in tokens
|
||||||
|
- container: N_THREADS
|
||||||
|
description:
|
||||||
|
en_us: CPU threads (0 = auto-detect all cores)
|
||||||
|
- container: MAX_TOKENS
|
||||||
|
description:
|
||||||
|
en_us: Maximum tokens to generate per response
|
||||||
|
- container: TZ
|
||||||
|
description:
|
||||||
|
en_us: Timezone, for example Europe/Stockholm
|
||||||
|
ports:
|
||||||
|
- container: "8080"
|
||||||
|
description:
|
||||||
|
en_us: llama.cpp REST API port
|
||||||
|
volumes:
|
||||||
|
- container: /models
|
||||||
|
description:
|
||||||
|
en_us: Model GGUF files directory
|
||||||
|
- container: /logs
|
||||||
|
description:
|
||||||
|
en_us: Server log output
|
||||||
|
|
||||||
|
x-casaos:
|
||||||
|
architectures:
|
||||||
|
- amd64
|
||||||
|
- arm64
|
||||||
|
main: llama-server
|
||||||
|
category: ai
|
||||||
|
author: Joachim Friberg
|
||||||
|
developer: Joachim Friberg
|
||||||
|
icon: https://cdn.simpleicons.org/llama
|
||||||
|
tagline:
|
||||||
|
en_us: CPU-only LLM inference server with REST API
|
||||||
|
description:
|
||||||
|
en_us: >
|
||||||
|
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
|
||||||
|
CPU-only with AVX2/AVX512 optimization. Requires manual model download.
|
||||||
|
title:
|
||||||
|
en_us: Llama Server
|
||||||
|
index: /
|
||||||
|
port_map: "8080"
|
||||||
@@ -0,0 +1,73 @@
|
|||||||
|
# OpenWebUI
|
||||||
|
|
||||||
|
Modern chat web interface for local LLMs. Connects to llama-server via Docker internal networking.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
- **Port**: 3000 (TCP)
|
||||||
|
- **Memory**: 2G reservation
|
||||||
|
- **Category**: AI / LLM UI
|
||||||
|
|
||||||
|
Requires the **llama-server** app to be running first. Connects to `http://llama-server:8080` internally.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
1. Deploy and start **llama-server** app first
|
||||||
|
2. Download a GGUF model into llama-server's `/models` directory
|
||||||
|
3. Ensure llama-server container is healthy
|
||||||
|
|
||||||
|
## Access
|
||||||
|
|
||||||
|
Open in browser:
|
||||||
|
|
||||||
|
```
|
||||||
|
http://<your-zimaos-host>:3000
|
||||||
|
```
|
||||||
|
|
||||||
|
First run may take a moment to initialize.
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
| Variable | Default | Description |
|
||||||
|
|----------|---------|-------------|
|
||||||
|
| `OLLAMA_BASE_URL` | `http://llama-server:8080` | Internal URL to llama-server API |
|
||||||
|
| `WEBUI_PORT` | `3000` | Container listen port |
|
||||||
|
| `TZ` | `Europe/Stockholm` | Timezone |
|
||||||
|
|
||||||
|
## If Connection Fails
|
||||||
|
|
||||||
|
1. Verify llama-server is running: `docker ps | grep llama-server`
|
||||||
|
2. Check llama-server logs: `docker logs llama-server`
|
||||||
|
3. Ensure llama-server MODEL env matches your downloaded file
|
||||||
|
4. From ZimaOS shell, test connectivity:
|
||||||
|
```bash
|
||||||
|
curl http://llama-server:8080/v1/models
|
||||||
|
```
|
||||||
|
|
||||||
|
## Volumes
|
||||||
|
|
||||||
|
| Path | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `/app/backend/data` | OpenWebUI persistent data (chat history, settings) |
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- `amd64` (Intel/AMD x86_64)
|
||||||
|
- `arm64` (Apple Silicon, ARM servers)
|
||||||
|
|
||||||
|
## Security
|
||||||
|
|
||||||
|
- `security_opt: no-new-privileges:true`
|
||||||
|
- `cap_drop: ALL`
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**"Cannot connect to LLM" error in UI**
|
||||||
|
- Verify llama-server is running before open-webui
|
||||||
|
- Check that `OLLAMA_BASE_URL` is set to `http://llama-server:8080`
|
||||||
|
- Verify model file exists in `/DATA/AppData/llama-server/models/`
|
||||||
|
|
||||||
|
**Slow responses**
|
||||||
|
- 7B models on CPU are limited by single-thread performance
|
||||||
|
- 3B models recommended for interactive speeds (~15+ tok/s)
|
||||||
|
- Close other apps to free RAM
|
||||||
@@ -0,0 +1,68 @@
|
|||||||
|
name: open-webui
|
||||||
|
|
||||||
|
services:
|
||||||
|
open-webui:
|
||||||
|
image: ghcr.io/open-webui/open-webui:main
|
||||||
|
container_name: open-webui
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
TZ: Europe/Stockholm
|
||||||
|
OLLAMA_BASE_URL: http://llama-server:8080
|
||||||
|
WEBUI_PORT: "3000"
|
||||||
|
ports:
|
||||||
|
- target: 3000
|
||||||
|
published: "3000"
|
||||||
|
protocol: tcp
|
||||||
|
volumes:
|
||||||
|
- type: bind
|
||||||
|
source: /DATA/AppData/$AppID/data
|
||||||
|
target: /app/backend/data
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
memory: 2G
|
||||||
|
depends_on:
|
||||||
|
- llama-server
|
||||||
|
security_opt:
|
||||||
|
- no-new-privileges:true
|
||||||
|
cap_drop:
|
||||||
|
- ALL
|
||||||
|
x-casaos:
|
||||||
|
envs:
|
||||||
|
- container: OLLAMA_BASE_URL
|
||||||
|
description:
|
||||||
|
en_us: Internal URL to llama-server API (http://llama-server:8080)
|
||||||
|
- container: WEBUI_PORT
|
||||||
|
description:
|
||||||
|
en_us: Web UI listen port inside container
|
||||||
|
- container: TZ
|
||||||
|
description:
|
||||||
|
en_us: Timezone, for example Europe/Stockholm
|
||||||
|
ports:
|
||||||
|
- container: "3000"
|
||||||
|
description:
|
||||||
|
en_us: OpenWebUI web interface port
|
||||||
|
volumes:
|
||||||
|
- container: /app/backend/data
|
||||||
|
description:
|
||||||
|
en_us: OpenWebUI persistent data (chat history, settings)
|
||||||
|
|
||||||
|
x-casaos:
|
||||||
|
architectures:
|
||||||
|
- amd64
|
||||||
|
- arm64
|
||||||
|
main: open-webui
|
||||||
|
category: ai
|
||||||
|
author: Joachim Friberg
|
||||||
|
developer: Joachim Friberg
|
||||||
|
icon: https://cdn.simpleicons.org/webui
|
||||||
|
tagline:
|
||||||
|
en_us: Modern chat UI for local LLMs
|
||||||
|
description:
|
||||||
|
en_us: >
|
||||||
|
OpenWebUI provides a modern, feature-rich web interface for interacting with local LLMs.
|
||||||
|
Connect to llama-server or any OpenAI-compatible API. Requires llama-server app to be running first.
|
||||||
|
title:
|
||||||
|
en_us: OpenWebUI
|
||||||
|
index: /
|
||||||
|
port_map: "3000"
|
||||||
Reference in New Issue
Block a user