Runbooks
Operational procedures for common and emergency situations.
Quota exhaustion
Symptoms: workflows fail with 403, gh api calls return empty, validate-config
skips with "quota too low".
Check current state:
curl -sf -H "Authorization: token $SYNC_TOKEN" \
"https://api.github.com/rate_limit" | \
python3 -c "
import sys, json
from datetime import datetime, timezone
d = json.load(sys.stdin)
core = d['resources']['core']
reset = datetime.fromtimestamp(core['reset'], tz=timezone.utc)
now = datetime.now(tz=timezone.utc)
eta = max(0, int((reset - now).total_seconds()))
print(f'Remaining : {core[\"remaining\"]}/{core[\"limit\"]}')
print(f'Reset at : {reset.strftime(\"%H:%M:%S UTC\")}')
print(f'ETA : {eta//60}m {eta%60}s')
"
Recovery:
- Wait for the reset (up to 1 hour). The reset time is shown above.
- While waiting, use the time productively — all local operations (config validation, ShellCheck, pytest, file edits) work without quota.
- After reset, trigger
pre-flush-prep.ymlto clear the queue and restart the mirror chain cleanly.
Prevention: quota-reserve.yml cancels low-priority runs at < 1000 remaining.
If exhaustion is recurring, check config/workflow-cost-profiles.yml for
unexpectedly expensive workflows and consider raising MIN_QUOTA thresholds.
Queue pile-up
Symptoms: many workflows stuck in "queued" state, runners appear busy but nothing is completing.
Check:
# Via GitHub CLI (requires quota)
gh run list --repo Interested-Deving-1896/fork-sync-all --status queued --limit 50
Recovery:
- Trigger
queue-manager.ymlmanually — it deduplicates and evicts runs queued > 25 minutes. - If the queue is severely backed up, trigger
pre-flush-prep.ymlwithskip_merge_prs=trueandskip_cleanup=true— Step 1 aggressively clears stale runs before dispatching the flush. - As a last resort, trigger
critical-deploy.yml— it performs an aggressive queue clear and dispatches with priority.
Token expiry
Symptoms: token-health.yml opens an issue labelled token-monitor, or
workflows fail with 401.
Check expiry:
bash scripts/token-monitor.sh
Rotate a token:
- Generate a new PAT at https://github.com/settings/tokens
- Go to
rotate-token.yml→ Run workflow - Select the secret name from the dropdown
- Paste the new token value
- Leave
validatechecked - Update the expiry date in
AGENTS.mdtoken rotation table
For OSP org secrets (MIRROR_TOKEN, ORG_MIRROR_OSP_TO_OOC), see the
Token Rotation section in AGENTS.md — these
require a separate PAT with admin:org on OpenOS-Project-OSP.
Mirror chain broken
Symptoms: repos in OSP or OOC are behind I-D-1896 by more than one cycle, or GitLab mirrors show stale commits.
Diagnose:
# Check GitLab sync status (requires quota)
gh workflow run check-gitlab-sync.yml --repo Interested-Deving-1896/fork-sync-all
Recovery by leg:
| Broken leg | Fix |
|---|---|
| I-D-1896 → OSP | Trigger mirror-to-osp.yml manually |
| OSP → OOC | Trigger mirror-osp-to-ooc.yaml manually |
| OSP → GitLab | Trigger mirror-osp-to-gitlab.yml manually |
| GitLab → I-D-1896 | Trigger sync-from-gitlab.yml manually |
For a full chain reset, trigger full-chain-flush.yml directly (or via
pre-flush-prep.yml for a clean pre-flight first).
Config validation failure
Symptoms: validate-config.yml fails on push, blocking the flush.
Run locally to see the error:
python3 scripts/validate-gitlab-subgroups.py config/gitlab-subgroups.yml
python3 scripts/validate-registered-imports.py registered-imports.json
python3 scripts/validate-cost-profiles.py config/workflow-cost-profiles.yml
python3 scripts/validate-priority-tiers.py config/workflow-priority-tiers.yml
python3 scripts/validate-template-config.py
python3 scripts/validate-workflow-guards.py
Common causes:
- Duplicate repo name in
gitlab-subgroups.yml - Duplicate
source_urlortarget_nameinregistered-imports.json - Workflow added to
.github/workflows/but not registered inworkflow-priority-tiers.ymlorworkflow-sync.yml - Duplicate name in
workflow-priority-tiers.yml
Vendor component agnostic check failure
Symptoms: enforce-agnostic-vendor.yml fails on a PR touching vendor/.
Run locally:
bash scripts/check-vendor-agnostic.sh vendor
The output shows the exact file, line, and category of violation. Fix by:
- Removing the hardcoded fallback value (set to empty string)
- Moving the value to a CI variable / repo var
- Adding
# check-vendor-agnostic: ignoreif the value is genuinely deployment-agnostic (rare — document why)
README render failure
Symptoms: validate-readme-render.yml fails on a PR.
Run locally:
bash scripts/check-readme-render.sh README.md
# Also run the self-test to verify the checker itself is working:
bash scripts/tests/test-check-readme-render-mobile.sh
Common causes: unclosed fences, leaked log lines, bare [text] links without
URLs, raw angle brackets, broken tables, missing H1.
OTA delivery failure
Symptoms: ota-release.yml fails for one or more forks, or a fork's
ota-self-update.yml fails.
For a single fork:
- Check the fork's
ota-self-update.ymlrun logs for the specific error - Common causes: fork has diverged significantly,
pinned_shais stale, or the fork's.ota/config.ymlhas an invalid field - To reset: update
pinned_shain the fork's.ota/config.ymlto the current upstream HEAD SHA, then re-triggerota-self-update.yml
To skip a fork temporarily:
Set disabled: true in its config/ota-registry.yml entry.
To re-deliver to all forks:
Push a new semver tag to fork-sync-all — ota-release.yml triggers automatically.
Incident response checklist
For any production incident affecting the mirror chain:
- Check quota — if exhausted, wait for reset before doing anything else
- Check queue — trigger
queue-manager.ymlto clear pile-ups - Identify the broken leg — use
check-gitlab-sync.ymland manual inspection - Fix the specific leg — trigger the relevant mirror workflow directly
- Validate config — run all validators locally before triggering a flush
- Run pre-flush-prep — let it clean up and restart the chain
- Monitor — watch the first few workflow runs after recovery for secondary failures