# Llama Server Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API. **Image**: `ghcr.io/ggml-org/llama.cpp:server-b8840` (CPU-only, AVX2/AVX512) ## Purpose - **Port**: 8080 (TCP) - **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM) - **Category**: AI / LLM inference CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed. ## Model Setup llama-server does not bundle models. You must download GGUF files manually. SSH into your ZimaOS device and run: ```bash # Create models directory mkdir -p /DATA/AppData/llama-server/models # Example: Download Llama 3.2 3B Q4_K_M (~1.8GB) curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \ "https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf" ``` ## Recommended Models for 16GB RAM | Model | Size | Quant | RAM Needed | Speed (est.) | |-------|------|-------|------------|--------------| | Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | | Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s | | Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | | Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s | For 7B models, close other apps to free RAM. 8G reservation leaves headroom. ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` | | `CTX_SIZE` | `2048` | Context window size (tokens) | | `N_THREADS` | `0` | CPU threads (0 = auto) | | `HOST` | `0.0.0.0` | Listen address | | `PORT` | `8080` | API port | | `MAX_TOKENS` | `512` | Max tokens per response | Change `MODEL` to match your downloaded file. Restart container after changing. ## API Testing Once running, test the API: ```bash # Check server info curl http://localhost:8080/v1/models # Chat completions (OpenAI-compatible) curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.2-3b-q4_k_m.gguf", "messages": [{"role": "user", "content": "Hello, who are you?"}], "max_tokens": 128 }' ``` ## Volumes | Path | Description | |------|-------------| | `/models` | GGUF model files | | `/logs` | Server log output | ## Architecture - `amd64` (Intel/AMD x86_64) - `arm64` (Apple Silicon, ARM servers) ## Security - `security_opt: no-new-privileges:true` - `cap_drop: ALL` - CPU-only, no privileged access needed