89 lines
2.4 KiB
Markdown
89 lines
2.4 KiB
Markdown
# Llama Server
|
|
|
|
Local LLM inference server using llama.cpp. Serves GGUF models via OpenAI-compatible REST API.
|
|
|
|
**Image**: `ghcr.io/ggml-org/llama.cpp:server-b8840` (CPU-only, AVX2/AVX512)
|
|
|
|
## Purpose
|
|
|
|
- **Port**: 8080 (TCP)
|
|
- **Memory**: 8G reservation (7B Q4 models fit in ~6-7GB RAM)
|
|
- **Category**: AI / LLM inference
|
|
|
|
CPU-only inference with AVX2/AVX512 auto-detection. No GPU needed.
|
|
|
|
## Model Setup
|
|
|
|
llama-server does not bundle models. You must download GGUF files manually.
|
|
|
|
SSH into your ZimaOS device and run:
|
|
|
|
```bash
|
|
# Create models directory
|
|
mkdir -p /DATA/AppData/llama-server/models
|
|
|
|
# Example: Download Llama 3.2 3B Q4_K_M (~1.8GB)
|
|
curl -L -o /DATA/AppData/llama-server/models/llama-3.2-3b-q4_k_m.gguf \
|
|
"https://huggingface.co/QuantFactory/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct.Q4_K_M.gguf"
|
|
```
|
|
|
|
## Recommended Models for 16GB RAM
|
|
|
|
| Model | Size | Quant | RAM Needed | Speed (est.) |
|
|
|-------|------|-------|------------|--------------|
|
|
| Llama 3.2 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
|
| Phi-3.5 Mini 3B | 1.8GB | Q4_K_M | ~4GB | ~15-20 tok/s |
|
|
| Mistral 7B | 4.1GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
|
| Qwen 2.5 7B | 4.4GB | Q4_K_M | ~6-7GB | ~8-12 tok/s |
|
|
|
|
For 7B models, close other apps to free RAM. 8G reservation leaves headroom.
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `MODEL` | `llama-3.2-3b-q4_k_m.gguf` | Model filename in `/models` |
|
|
| `CTX_SIZE` | `2048` | Context window size (tokens) |
|
|
| `N_THREADS` | `0` | CPU threads (0 = auto) |
|
|
| `HOST` | `0.0.0.0` | Listen address |
|
|
| `PORT` | `8080` | API port |
|
|
| `MAX_TOKENS` | `512` | Max tokens per response |
|
|
|
|
Change `MODEL` to match your downloaded file. Restart container after changing.
|
|
|
|
## API Testing
|
|
|
|
Once running, test the API:
|
|
|
|
```bash
|
|
# Check server info
|
|
curl http://localhost:8080/v1/models
|
|
|
|
# Chat completions (OpenAI-compatible)
|
|
curl http://localhost:8080/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "llama-3.2-3b-q4_k_m.gguf",
|
|
"messages": [{"role": "user", "content": "Hello, who are you?"}],
|
|
"max_tokens": 128
|
|
}'
|
|
```
|
|
|
|
## Volumes
|
|
|
|
| Path | Description |
|
|
|------|-------------|
|
|
| `/models` | GGUF model files |
|
|
| `/logs` | Server log output |
|
|
|
|
## Architecture
|
|
|
|
- `amd64` (Intel/AMD x86_64)
|
|
- `arm64` (Apple Silicon, ARM servers)
|
|
|
|
## Security
|
|
|
|
- `security_opt: no-new-privileges:true`
|
|
- `cap_drop: ALL`
|
|
- CPU-only, no privileged access needed
|