Files
SkinbaseNova/docs/feed-rollout-runbook.md
2026-02-14 15:14:12 +01:00

4.8 KiB

Feed Rollout Runbook (clip-cosine-v2, prod set 1)

Scope

  • Candidate: clip-cosine-v2 with weights w1=0.52, w2=0.23, w3=0.15, w4=0.10
  • Baseline: clip-cosine-v1
  • Rollout gates: 10% -> 50% -> 100%
  • Temporary policy: save_rate is informational only until save-event schema reliability is confirmed in production.

Pre-flight checks

  1. Confirm config values:
    • DISCOVERY_ROLLOUT_ENABLED=true
    • DISCOVERY_ROLLOUT_BASELINE_ALGO_VERSION=clip-cosine-v1
    • DISCOVERY_ROLLOUT_CANDIDATE_ALGO_VERSION=clip-cosine-v2
    • DISCOVERY_ROLLOUT_ACTIVE_GATE=g10
    • DISCOVERY_FORCE_ALGO_VERSION is empty
  2. Confirm candidate weights are active in config/discovery.php and env overrides.
  3. Confirm ingestion health for discovery events:
    • event_id populated for all new events
    • favorite and download events present in user_discovery_events
  4. Run daily aggregation:
    • php artisan analytics:aggregate-feed --date=YYYY-MM-DD

Gate progression

Gate 1: 10%

  • Set: DISCOVERY_ROLLOUT_ACTIVE_GATE=g10
  • Observe for at least 2-3 days with minimum sample volume.
  • Required checks:
    • CTR delta vs baseline
    • Long-dwell-share delta vs baseline
    • Diversity concentration delta vs baseline
    • Save-rate trend (informational only)

Promote to 50% only if no rollback trigger fires and no persistent warning trend is present.

Gate 2: 50%

  • Set: DISCOVERY_ROLLOUT_ACTIVE_GATE=g50
  • Observe for 3-5 days with stable daily traffic.
  • Apply same checks and thresholds.

Promote to 100% only with at least 2 consecutive healthy days.

Gate 3: 100%

  • Set: DISCOVERY_ROLLOUT_ACTIVE_GATE=g100
  • Keep baseline available for rapid rollback via force toggle.

Monitoring thresholds (candidate vs baseline)

  • CTR:
    • Warning: drop >= 3%
    • Rollback: drop >= 5% (or >= 10% in a single severe window)
  • Long dwell share ((dwell_30_120 + dwell_120_plus) / clicks):
    • Warning: drop >= 4%
    • Rollback: drop >= 8% (or >= 12% in a single severe window)
  • Diversity concentration (e.g. top-author/top-category share, near-duplicate concentration):
    • Warning: rise >= 10%
    • Rollback: rise >= 15%

Rollback actions

Immediate rollback (fastest)

  • Set DISCOVERY_FORCE_ALGO_VERSION=clip-cosine-v1
  • Reload config/cache as needed in your deployment flow.
  • Verify feed responses show meta.algo_version=clip-cosine-v1.

Standard rollback

  • Set DISCOVERY_ROLLOUT_ACTIVE_GATE=g10 (or disable rollout)
  • Keep candidate enabled only for controlled validation traffic.

Save-event schema note and fix

Observed issue class in mixed environments: save-event writes can fail if discovery event schema differs from code expectations (e.g., meta/metadata drift, required event_id).

Implemented fix path:

  • Ingestion now always writes event_id and inserts schema-aware metadata (meta if present, otherwise metadata if present).
  • Keep DISCOVERY_EVAL_SAVE_RATE_INFORMATIONAL=true until production confirms stable save-event ingestion.

Validation query examples:

  • Save events by day:
    • SELECT event_date, COUNT(*) FROM user_discovery_events WHERE event_type IN ('favorite','download') GROUP BY event_date ORDER BY event_date DESC;
  • Null/empty event id check:
    • SELECT COUNT(*) FROM user_discovery_events WHERE event_id IS NULL OR event_id = '';

Daily operator checklist

  1. Run feed aggregation for the previous day.
  2. Run evaluator and compare commands:
    • php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json
    • php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json
  3. Record deltas for CTR, long_dwell_share, diversity concentration.
  4. Record save_rate as informational only.
  5. Decide: hold, promote gate, or rollback.

First 24h verification checklist

  1. Confirm rollout activation and gate state:
  • DISCOVERY_ROLLOUT_ENABLED=true
  • DISCOVERY_ROLLOUT_ACTIVE_GATE=g10
  • DISCOVERY_FORCE_ALGO_VERSION empty
  1. Verify both algos are receiving traffic in analytics:
  • candidate (clip-cosine-v2) should be near 10% share (allow normal variance)
  • baseline (clip-cosine-v1) remains dominant
  1. Run aggregation/evaluation at least twice in first day (midday + end-of-day):
  • php artisan analytics:aggregate-feed --date=YYYY-MM-DD
  • php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json
  • php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json
  1. Check guardrails:
  • CTR drop < rollback threshold
  • long_dwell_share drop < rollback threshold
  • diversity concentration rise < rollback threshold
  1. Check save-event ingestion health:
  • save events (favorite,download) are arriving in user_discovery_events
  • event_id is always populated
  1. If any rollback trigger is breached, apply emergency rollback preset immediately.