2.4 KiB
2.4 KiB
Llama Server
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
Image: ghcr.io/ggml-org/llama.cpp:server-b8840 (CPU-only, AVX2/AVX512)
Purpose
- Port: 8080 (TCP)
- Memory: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
- Category: AI / LLM inference
CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.
Model Setup
llama-server does not bundle models. You must download GGUF files manually.
SSH into your ZimaOS device and run:
# Create models directory
mkdir -p /DATA/AppData/llama-server/models
# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
Recommended Models for 16GB RAM
| Model | Size | Quant | RAM Needed | Speed (est.) |
|---|---|---|---|---|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
For 7B models, close other apps to free RAM. 8G reservation leaves headroom.
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL |
llama-3.2-3b-q4_k_m.gguf |
Model filename in /models |
CTX_SIZE |
2048 |
Context window size (tokens) |
N_THREADS |
0 |
CPU threads (0 = auto) |
HOST |
0.0.0.0 |
Listen address |
PORT |
8080 |
API port |
MAX_TOKENS |
512 |
Max tokens per response |
Change MODEL to match your downloaded file. Restart container after changing.
API Testing
Once running, test the API:
# Check server info
curl http://localhost:8080/v1/models
# Chat completions (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-q4_k_m.gguf",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 128
}'
Volumes
| Path | Description |
|---|---|
/models |
GGUF model files |
/logs |
Server log output |
Architecture
amd64(Intel/AMD x86_64)arm64(Apple Silicon, ARM servers)
Security
security_opt: no-new-privileges:truecap_drop: ALL- CPU-only, no privileged access needed