Files

Gregor Klevze 59c9584250 llm: add FastAPI shim, gateway LLM endpoints, tests, and docs

2026-04-12 09:41:21 +02:00

24 KiB

Raw Blame History

Skinbase Vision Stack — Usage Guide

This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity, and optional LLM services).

Overview

Services: gateway, clip, blip, yolo, qdrant, qdrant-svc, card-renderer, maturity, llm (FastAPI each except qdrant; llm is a thin FastAPI shim that manages an internal llama-server process).
Gateway is the public API endpoint; the other services are internal.

Model overview

CLIP: Contrastive Language–Image Pretraining — maps images and text into a shared embedding space. Used for zero-shot image tagging, similarity search, and returning ranked tags with confidence scores.
BLIP: Bootstrapping Language-Image Pre-training — a vision–language model for image captioning and multimodal generation. BLIP produces human-readable captions (multiple variants supported) and can be tuned with max_length.
YOLO: You Only Look Once — a family of real-time object-detection models. YOLO returns detected objects with class, confidence, and bbox (bounding box coordinates); use conf to filter low-confidence detections.
Qdrant: High-performance vector similarity search engine. Stores CLIP image embeddings and enables reverse image search (find similar images). The qdrant-svc wrapper auto-embeds images via CLIP before upserting.
Card Renderer: Generates branded social-card images (e.g. Open Graph previews) from artwork images. Applies smart center-weighted cropping, gradient overlays, title/username/tag text, and an optional logo. Returns binary image bytes (WebP by default). Template: nova-artwork-v1.
Maturity: Dedicated NSFW/maturity classifier. Accepts an image and returns a normalized safety signal including maturity_label (safe/mature), confidence, raw score, optional sublabels (e.g. nsfw), and an action_hint (safe, review, flag_high) designed for Nova moderation workflows. Powered by Falconsai/nsfw_image_detection (ViT-based, HuggingFace). Thresholds are configurable via environment variables.
LLM: Internal text-generation service backed by llama.cpp and a GGUF Qwen3 model. Exposed through the gateway for non-streaming chat completions and model discovery. Intended for Nova workflows such as creator bios, metadata suggestions, moderation helper text, and other short internal generation tasks.

Prerequisites

Docker Desktop (with docker compose) or a Docker environment.
Recommended: at least 8GB RAM for CPU-only; more for model memory or GPU use.

Start the stack

Before starting the stack, create a .env file for runtime secrets and environment overrides.

Minimum example:

API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_huggingface_token_here

Notes:

API_KEY protects gateway endpoints.
HUGGINGFACE_TOKEN is required if the configured BLIP model requires Hugging Face authentication.
Startup uses container healthchecks, so initial boot can take longer while models download and warm up.

Optional maturity configuration (can be added to .env to override defaults):

MATURITY_MODEL=Falconsai/nsfw_image_detection
MATURITY_THRESHOLD_MATURE=0.80
MATURITY_THRESHOLD_REVIEW=0.60
MATURITY_ENABLED=true

MATURITY_THRESHOLD_MATURE: score above this → mature + flag_high (default 0.80).
MATURITY_THRESHOLD_REVIEW: score above this but below mature threshold → mature + review (default 0.60).
MATURITY_ENABLED: set to false to disable maturity endpoints at the gateway without removing the service.

Optional LLM configuration:

LLM_URL=http://llm:8080
LLM_ENABLED=false
LLM_TIMEOUT=120
LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
LLM_MAX_TOKENS_DEFAULT=256
LLM_MAX_TOKENS_HARD_LIMIT=1024
LLM_MAX_REQUEST_BYTES=65536

# Local llm profile only
MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
LLM_CONTEXT_SIZE=4096
LLM_THREADS=4
LLM_GPU_LAYERS=0
LLM_EXTRA_ARGS=

Run from repository root:

docker compose up -d --build

That starts the default vision stack only.

To also start the local LLM service:

docker compose --profile llm up -d --build

Before enabling the llm profile, provision the GGUF model described in models/qwen3/README.md and set LLM_ENABLED=true in .env.

For small production hosts, the preferred setup is usually to keep the gateway local and point LLM_URL at a separate private LLM host:

LLM_ENABLED=true
LLM_URL=http://private-llm-host:8080

Stop:

docker compose down

View logs:

docker compose logs -f
docker compose logs -f gateway

Health

Check the gateway health endpoint:

curl https://vision.klevze.net/health

Check LLM-specific gateway health:

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health

LLM smoke test checklist

Use this sequence on a machine with Docker available after you have mounted the GGUF model and enabled the gateway with LLM_ENABLED=true.

Start the gateway with the llm profile.

docker compose --profile llm up -d --build gateway llm

Confirm the LLM service came up cleanly.

docker compose ps llm
docker compose logs --tail=100 llm

Check the repo-owned internal health endpoint.

curl http://127.0.0.1:8080/health

Expected fields: status, model, context_size, threads.

Confirm the gateway sees the LLM backend.

curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health

Verify model discovery.

curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/models

Run a small chat request through the gateway.

curl -X POST http://127.0.0.1:8003/v1/chat/completions \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
      {"role": "user", "content": "Write one short admin help sentence about reviewing wallpaper metadata."}
    ],
    "max_tokens": 60,
    "stream": false
  }'

If startup or health fails, inspect the relevant logs.

docker compose logs --tail=200 llm
docker compose logs --tail=200 gateway

Universal analyze (ALL)

Analyze an image by URL (gateway aggregates CLIP, BLIP, YOLO):

curl -X POST https://vision.klevze.net/analyze/all \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","limit":5}'

File upload (multipart):

curl -X POST https://vision.klevze.net/analyze/all/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "limit=5"

Parameters:

limit: optional integer to limit returned tag/caption items.

Individual services (via gateway)

These endpoints call the specific service through the gateway.

CLIP — tags

URL request:

curl -X POST https://vision.klevze.net/analyze/clip \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","limit":5}'

File upload:

curl -X POST https://vision.klevze.net/analyze/clip/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "limit=5"

Return: JSON list of tags with confidence scores.

BLIP — captioning

URL request:

curl -X POST https://vision.klevze.net/analyze/blip \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","variants":3}'

File upload:

curl -X POST https://vision.klevze.net/analyze/blip/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "variants=3" \
  -F "max_length=60"

Parameters:

variants: number of caption variants to return.
max_length: optional maximum caption length.

Return: one or more caption strings (optionally with scores).

YOLO — object detection

URL request:

curl -X POST https://vision.klevze.net/analyze/yolo \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","conf":0.25}'

File upload:

curl -X POST https://vision.klevze.net/analyze/yolo/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "conf=0.25"

Parameters:

conf: confidence threshold (0.0–1.0).

Return: detected objects with class, confidence, and bbox (bounding box coordinates).

Maturity — NSFW / maturity analysis

Analyzes an image for mature or NSFW content and returns a structured signal intended for Nova moderation workflows.

URL request:

curl -X POST https://vision.klevze.net/analyze/maturity \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp"}'

File upload:

curl -X POST https://vision.klevze.net/analyze/maturity/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp"

Example response:

{
  "maturity_label": "mature",
  "confidence": 0.94,
  "score": 0.94,
  "labels": ["nsfw"],
  "model": "Falconsai/nsfw_image_detection",
  "threshold_used": 0.80,
  "analysis_time_ms": 183.0,
  "source": "maturity-service",
  "action_hint": "flag_high",
  "advisory": "High-confidence mature content detected"
}

Response fields:

Field	Type	Description
`maturity_label`	string	`safe` or `mature`
`confidence`	float	Confidence in the label decision (0–1). For `safe`, this is `1 - score`.
`score`	float	Raw NSFW probability from the model (0–1).
`labels`	array	Sublabels when mature: currently `["nsfw"]`. Empty for safe results.
`model`	string	Model identifier / HuggingFace model ID.
`threshold_used`	float	The threshold value that determined the label.
`analysis_time_ms`	float	Inference time in milliseconds.
`source`	string	Always `maturity-service`.
`action_hint`	string	`safe`, `review`, or `flag_high`. Use this in Nova to drive blur/queue/flag decisions.
`advisory`	string	Short human-readable explanation.

action_hint decision logic:

flag_high: score ≥ MATURITY_THRESHOLD_MATURE (default 0.80) — high-confidence mature, flag for moderation.
review: score ≥ MATURITY_THRESHOLD_REVIEW (default 0.60) but below mature threshold — possible mature, queue for human review.
safe: score below both thresholds — content appears safe.

If the maturity service is unavailable the gateway returns a 502 or 503 error. Nova must not treat a gateway failure as a safe result — retry or queue for later processing.

LLM / Chat endpoints

The gateway validates requests, clamps max_tokens to configured limits, rejects oversized payloads, and normalizes downstream failures into JSON under an error key.

OpenAI-style chat completions

curl -X POST https://vision.klevze.net/v1/chat/completions \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
      {"role": "user", "content": "Write a short biography for a creator known for sci-fi environments."}
    ],
    "temperature": 0.7,
    "max_tokens": 220,
    "stream": false
  }'

Supported request fields:

messages (required)
temperature
max_tokens
stream (false only in v1)
top_p
stop
presence_penalty
frequency_penalty

Validation rules:

At least one message is required.
Roles must be system, user, or assistant.
Empty message content is rejected.
Oversized request bodies return 413.
max_tokens is clamped to LLM_MAX_TOKENS_HARD_LIMIT.

Project-friendly chat response

curl -X POST https://vision.klevze.net/ai/chat \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful metadata assistant."},
      {"role": "user", "content": "Suggest five tags for a fantasy castle wallpaper."}
    ]
  }'

Example response:

{
  "model": "qwen3-1.7b-instruct-q4_k_m",
  "content": "fantasy castle, moonlit fortress, medieval towers, epic landscape, digital painting",
  "finish_reason": "stop",
  "usage": {
    "prompt_tokens": 48,
    "completion_tokens": 19,
    "total_tokens": 67
  }
}

Model discovery

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models

Failure modes

401: missing or invalid API key
413: request body exceeds LLM_MAX_REQUEST_BYTES
422: validation failure or unsupported streaming request
503: LLM disabled or upstream unavailable
504: upstream timeout

Vector DB (Qdrant)

Use the Qdrant gateway endpoints to store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.

Qdrant point IDs must be either an unsigned integer or a UUID string. If you send another string value, the wrapper may replace it with a generated UUID and store the original value in metadata as _original_id.

Upsert (store) an image by URL

curl -X POST https://vision.klevze.net/vectors/upsert \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","id":"550e8400-e29b-41d4-a716-446655440000","metadata":{"category":"wallpaper","source":"upload"}}'

Parameters:

url (required): image URL to embed and store.
id (optional): point ID. Use an unsigned integer or UUID string. If omitted, a UUID is auto-generated.
metadata (optional): arbitrary key-value payload stored alongside the vector.
collection (optional): target collection name (defaults to images).

Upsert by file upload

curl -X POST https://vision.klevze.net/vectors/upsert/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F 'id=550e8400-e29b-41d4-a716-446655440001' \
  -F 'metadata_json={"category":"photo"}'

Upsert a pre-computed vector

curl -X POST https://vision.klevze.net/vectors/upsert/vector \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"vector":[0.1,0.2,...],"id":"550e8400-e29b-41d4-a716-446655440002","metadata":{"custom":"data"}}'

Search similar images by URL

curl -X POST https://vision.klevze.net/vectors/search \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","limit":5}'

Parameters:

url (required): query image URL.
limit (optional, default 5): number of results.
score_threshold (optional): minimum cosine similarity (0.0–1.0).
filter_metadata (optional): filter results by payload fields, e.g. {"is_public":true,"category_id":3}.
collection (optional): collection to search.
hnsw_ef (optional, int): override the HNSW ef parameter at query time. Higher = better recall, slightly more latency.
exact (optional, bool, default false): brute-force exact search. Avoid on large collections.
indexed_only (optional, bool, default false): restrict search to fully indexed segments only. Useful during bulk ingest.

Return: list of {"id", "score", "metadata"} sorted by similarity.

Search by file upload

curl -X POST https://vision.klevze.net/vectors/search/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "limit=5" \
  -F 'filter_metadata_json={"is_public":true}'

All URL search parameters are available as form fields; use filter_metadata_json (JSON string) for filters.

Search by pre-computed vector

curl -X POST https://vision.klevze.net/vectors/search/vector \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"vector":[0.1,0.2,...],"limit":5,"hnsw_ef":128}'

Collection management

List all collections:

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/collections

Get collection info:

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/collections/images

Create a custom collection:

curl -X POST https://vision.klevze.net/vectors/collections \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"name":"my_collection","vector_dim":512,"distance":"cosine"}'

Delete a collection:

curl -H "X-API-Key: <your-api-key>" -X DELETE https://vision.klevze.net/vectors/collections/my_collection

Full diagnostic inspect

Returns HNSW config, optimizer config, quantization, segment count, payload index coverage percentages, and RAM footprint estimate for every collection.

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/inspect

Payload index management

Payload indexes are critical for fast filtered vector search. Always create indexes for fields used in filter_metadata filters.

# List existing indexes
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/collections/images/indexes

# Create a single index
curl -X POST https://vision.klevze.net/vectors/collections/images/indexes \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"field":"is_public","type":"bool"}'

# Ensure multiple indexes exist (idempotent — safe to run multiple times)
curl -X POST https://vision.klevze.net/vectors/collections/images/ensure-indexes \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"fields":[{"field":"is_public","type":"bool"},{"field":"is_deleted","type":"bool"},{"field":"category_id","type":"integer"},{"field":"user_id","type":"keyword"}]}'

Supported index types: keyword, integer, float, bool, geo, datetime, text, uuid.

Collection configuration (HNSW / optimizer / quantization)

Updates HNSW, optimizer, or scalar quantization settings on an existing collection without data loss. HNSW graph and segment changes apply to newly created segments.

curl -X POST https://vision.klevze.net/vectors/collections/images/configure \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "hnsw_m": 16,
    "hnsw_ef_construct": 200,
    "hnsw_on_disk": false,
    "indexing_threshold": 20000,
    "default_segment_number": 4,
    "quantization_type": "int8",
    "quantization_quantile": 0.99,
    "quantization_always_ram": true
  }'

Parameters:

hnsw_m (int, 4–64): edges per node in the HNSW graph.
hnsw_ef_construct (int, 10–1000): ef during index construction.
hnsw_on_disk (bool): store HNSW graph on disk (saves RAM, slightly slower queries).
indexing_threshold (int): minimum vector changes before a segment is indexed.
default_segment_number (int, 1–32): target segment count for parallelism.
quantization_type (string, "int8" or null): enable scalar quantization (~4× RAM reduction).
quantization_quantile (float, 0.5–1.0, default 0.99): calibration quantile.
quantization_always_ram (bool, default true): keep quantized vectors in RAM.

Delete points

curl -X POST https://vision.klevze.net/vectors/delete \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"ids":["550e8400-e29b-41d4-a716-446655440000","550e8400-e29b-41d4-a716-446655440001"]}'

Get a point by ID

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/points/550e8400-e29b-41d4-a716-446655440000

Get a point by original application ID

If the wrapper had to replace your string id with a generated UUID, the original value is preserved in metadata as _original_id.

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/vectors/points/by-original-id/img-001

Card Renderer

The card renderer generates branded social-card images from artwork photos. It applies smart center-weighted cropping, a gradient overlay, title/subtitle/username/category text, optional tags, and an optional logo.

Default output: 1200×630 WebP (nova-artwork-v1 template).

List available templates

curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/cards/templates

Render a card from a URL

curl -X POST https://vision.klevze.net/cards/render \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://files.skinbase.org/img/aa/bb/cc/md.webp",
    "title": "Artwork Title",
    "subtitle": "Optional subtitle",
    "username": "@artist",
    "category": "Digital Art",
    "tags": ["surreal", "landscape"],
    "template": "nova-artwork-v1",
    "width": 1200,
    "height": 630,
    "output": "webp",
    "quality": 90,
    "show_logo": true
  }'

Returns binary image bytes with Content-Type: image/webp.

Render a card from a file upload

curl -X POST https://vision.klevze.net/cards/render/file \
  -H "X-API-Key: <your-api-key>" \
  -F "file=@/path/to/image.webp" \
  -F "title=Artwork Title" \
  -F "username=@artist" \
  -F "template=nova-artwork-v1" \
  -F "show_logo=true"

Returns binary image bytes.

Get card layout metadata (no image rendered)

curl -X POST https://vision.klevze.net/cards/render/meta \
  -H "X-API-Key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","title":"Artwork Title"}'

Returns crop coordinates and layout data without producing an image.

Request/Response notes

For URL requests use Content-Type: application/json.
For uploads use multipart/form-data with a file field.
Most gateway endpoints require the X-API-Key header.
Remote image URLs must resolve to public hosts and return an image content type.
The gateway aggregates and normalizes outputs for /analyze/all.

Running a single service

To run only one service via docker compose:

docker compose up -d --build clip

Or run locally (Python env) from the service folder:

# inside clip/ or blip/ or yolo/
uvicorn main:app --host 0.0.0.0 --port 8000

Production tips

Add authentication (API keys or OAuth) at the gateway.
Add rate-limiting and per-client quotas.
Keep model services on an internal Docker network.
For GPU: enable NVIDIA runtime and update service Dockerfiles / compose profiles.

Troubleshooting

Service fails to start: check docker compose logs <service> for model load errors.
BLIP startup error about Hugging Face auth: set HUGGINGFACE_TOKEN in .env and rebuild blip.
Qdrant upsert error about invalid point ID: use a UUID or unsigned integer for id, or omit it and use the returned generated id.
Image URL rejected before download: the URL may point to localhost, a private IP, a non-http/https scheme, or a non-image content type.
High memory / OOM: increase host memory or reduce model footprint; consider GPUs.
Slow startup: model weights load on service startup — expect extra time. The maturity service (start_period: 90s) may take longer on first boot as it downloads the classifier weights (~330 MB). Mount ~/.cache/huggingface as a volume to persist across rebuilds.
Maturity endpoint returns 503: MATURITY_ENABLED is set to false in environment configuration.
Maturity endpoint returns 502: the maturity container is unhealthy or still starting up; wait and retry.

Extending

Swap or update models in each service by editing that service's main.py.
Add request validation, timeouts, and retries in the gateway to improve robustness.

Files of interest

docker-compose.yml — composition and service definitions.
gateway/ — gateway FastAPI server.
clip/, blip/, yolo/ — service implementations and Dockerfiles.
maturity/ — NSFW/maturity classifier service (ViT-based, HuggingFace Falconsai/nsfw_image_detection).
qdrant/ — Qdrant API wrapper service (FastAPI).
card-renderer/ — card rendering service (FastAPI).
common/ — shared helpers (e.g., image I/O).

24 KiB Raw Blame History Unescape Escape

Skinbase Vision Stack — Usage Guide

Overview

Model overview

Prerequisites

Start the stack

Health

LLM smoke test checklist

Universal analyze (ALL)

Individual services (via gateway)

CLIP — tags

BLIP — captioning

YOLO — object detection

Maturity — NSFW / maturity analysis

LLM / Chat endpoints

OpenAI-style chat completions

Project-friendly chat response

Model discovery

Failure modes

Vector DB (Qdrant)

Upsert (store) an image by URL

Upsert by file upload

Upsert a pre-computed vector

Search similar images by URL

Search by file upload

Search by pre-computed vector

Collection management

Full diagnostic inspect

Payload index management

Collection configuration (HNSW / optimizer / quantization)

Delete points

Get a point by ID

Get a point by original application ID

Card Renderer

List available templates

Render a card from a URL

Render a card from a file upload

Get card layout metadata (no image rendered)

Request/Response notes

Running a single service

Production tips

Troubleshooting

Extending

Files of interest

24 KiB

Raw Blame History