Troubleshooting
Common problems and how to diagnose them. Start with the health page at https://{your-domain}/health — it covers most of what could be wrong in one place.
First-Line Diagnostics
Section titled “First-Line Diagnostics”Before diving into a specific problem, run these three commands. They catch the majority of issues in under a minute.
# 1. Health check — queue workers, Claude CLI, MCP servers, repo fetchabilitydocker exec yak php artisan yak:healthcheck
# 2. Recent application logsdocker logs yak --tail 100
# 3. Queue statusdocker exec yak php artisan queue:monitor yak-claude,defaultIf the health check is green and logs are clean, the problem is usually in the external service (Slack app, Linear webhook, GitHub App installation) rather than Yak itself.
PR Review Not Posting
Section titled “PR Review Not Posting”Symptoms: a PR opened on a repo with pr_review_enabled = true but no Yak review comment appears.
Checklist
Section titled “Checklist”- GitHub webhook reaching Yak? Check webhook delivery in GitHub App settings.
pull_request.openedandpull_request.synchronizemust be subscribed. - Repo active and enabled? Go to
/repos/{id}/editand confirm Active and PR Review are both on. - PR a draft, or authored by yak-bot? Both are intentionally skipped. Convert the draft to ready-for-review to trigger.
- Task dispatched but failed? Look in
/tasks?tab=reviewsfor a failed row. Common failure modes:- Sandbox checkout failure — the PR’s head wasn’t fetchable (force-pushed, branch deleted). Inspect the task activity log.
- Claude output didn’t contain a valid JSON block — usually means Claude failed the review instead of producing findings. The raw output lives in the task’s
result_summary.
- Path filters too aggressive? If the PR only touches excluded paths, the review is filtered out. Check
pr_review_path_excludeson the repo.
Resolution
Section titled “Resolution”Manual re-run from the TaskDetail page’s Re-run review button. If the same task keeps failing, lower the prompt’s max_findings_per_review or check the tasks-review prompt at /prompts for custom edits.
Task Stuck In running
Section titled “Task Stuck In running”Symptoms: a task’s status on the dashboard is running and hasn’t moved for several minutes beyond the expected Claude Code duration (typically 2–10 minutes).
Likely Causes
Section titled “Likely Causes”- Queue worker crashed mid-task. Supervisord will restart the worker, but the task row remains in
runninguntil a human resets it. - Claude CLI hung on network I/O. MCP server call, docker-compose bringing up a service, or a very long test suite.
- Budget exhausted silently. The task hit
--max-budget-usdand the job didn’t update status cleanly.
Diagnosis
Section titled “Diagnosis”# Is the Claude queue worker actually running?docker exec yak ps aux | grep "queue:work"
# Any recent errors in the logs?docker logs yak --tail 200 | grep -i error
# What does the task's own timeline say? Look at its detail page:# https://{your-domain}/tasks/{id}# The Debug section at the bottom has session ID, cost, turns, full Claude output.Resolution
Section titled “Resolution”If the worker is running but the task is stale, it will eventually time out on the 600s yak-claude queue timeout. To manually fail a stuck task:
docker exec yak php artisan tinker# >>> App\Models\YakTask::find($id)->update(['status' => 'failed', 'error_log' => 'Manual reset']);Manually resetting a task does not restart it. If you want Yak to try again, create a new task with the same description.
Setup Task Fails For A Repo
Section titled “Setup Task Fails For A Repo”Symptoms: adding a new repo via the dashboard or Ansible, and the repo’s setup_status goes from running to failed.
Common Causes
Section titled “Common Causes”- Missing system dependencies — the repo’s dev environment needs a tool that isn’t in the Yak Docker image (e.g., a specific Node version, or
pg_config). - Docker-in-Docker issues — the repo’s
docker-compose.ymlreferences services that need privileged mode or specific network configuration. - Private package registry auth — the repo uses a private npm/composer registry and the token isn’t configured. See Agent Environment Variables below.
CLAUDE.mdmissing or misleading — Claude couldn’t figure out the correct setup commands from the README alone.
Diagnosis
Section titled “Diagnosis”- Open the setup task’s detail page:
https://{your-domain}/tasks/{setup_task_id} - Expand the Debug section at the bottom
- Read the full Claude Code output — it reports exactly what command failed and why
The task detail page shows the Claude session as collapsible CI-style steps. Each turn includes the tool type, a description, duration, and the full terminal output.
Resolution
Section titled “Resolution”Fix the underlying issue, then re-run:
docker exec yak php artisan yak:setup-repo {slug}Or click Re-run Setup on the repo’s edit page.
If the issue is CLAUDE.md coverage, update the CLAUDE.md file in the target repo with the specific commands Yak got wrong. See Repositories → CLAUDE.md.
Webhooks Not Arriving
Section titled “Webhooks Not Arriving”Symptoms: you @yak in Slack (or assign a Linear issue to Yak, or trigger a Sentry alert) and nothing happens. No task appears on the dashboard.
Checklist
Section titled “Checklist”-
Is the channel actually enabled? The webhook endpoint only exists if the channel’s credentials are present in the vault and Ansible has been re-run since they were added.
Terminal window docker exec yak php artisan route:list | grep webhooksOnly enabled channels appear. If
/webhooks/slackis missing, Slack is not enabled. -
Can the external service reach your server? Test from the internet:
Terminal window curl -I https://{your-domain}/webhooks/slack# Expect: 200 (OK) or 401 (signature required), NOT 404 or connection refused -
Check Caddy / nginx logs for the incoming request:
Terminal window docker exec yak tail -f /var/log/caddy/access.log -
Check signing secrets match. Slack rejects with 401 if
slack_signing_secretdoesn’t match the app’s signing secret. Linear and Sentry do the same with their respective webhook secrets. -
UFW rules — port 443 must be open for inbound HTTPS:
Terminal window ssh root@{server} ufw status
Per-Channel Gotchas
Section titled “Per-Channel Gotchas”- Slack — channel history scope is required for thread reply matching. If clarification replies don’t route to the right task, verify
channels:historyis in the bot scopes. - Linear — the webhook must subscribe to Agent session events so delegation events come through, and the OAuth connection must be active; if the
linear_oauth_connections.installer_user_idcolumn is null, re-authorize the app from Yak’s settings. The install requires workspace admin approval — a non-admin install will appear to succeed but agent session events will not arrive. - Sentry — alerts must be tagged
yak-eligible. Alerts without the tag are ignored even if they hit the webhook. - GitHub — the App must be installed on the target org and must have webhook events for
check_suite.completedandpull_request.closed.
Claude CLI Errors
Section titled “Claude CLI Errors”CLI Not Found Or Not Responding
Section titled “CLI Not Found Or Not Responding”docker exec yak claude --versiondocker exec yak claude -p "Say hello" --output-format jsonIf the first command fails, the CLI isn’t installed in the container — rebuild the Docker image. If the second command hangs or errors, the CLI is installed but can’t reach Anthropic — check network connectivity.
Authentication Failures (Token Expired)
Section titled “Authentication Failures (Token Expired)”Symptoms: tasks fail with error_log mentioning auth, 401, or “token expired”. The health check posts an alert to Slack.
Claude Code authenticates via an interactive claude login session token stored in /home/yak/.claude/. When the token expires, every Claude Code job fails gracefully with an auth error — tasks are marked failed, a notification goes to the source, and the health check raises an alert.
Resolution:
docker exec -it yak claude loginFollow the browser-based OAuth flow. The new session token persists in the mounted volume and takes effect immediately. No restart is needed.
MCP Server Connection Issues
Section titled “MCP Server Connection Issues”docker exec yak cat /home/yak/mcp-config.jsonThis shows which MCP servers are currently configured. If a server you expect is missing, the corresponding channel isn’t enabled — re-run Ansible with the channel’s credentials set.
If a server is configured but Claude can’t reach it, check:
- Network connectivity from the Yak container
- Credentials (
GITHUB_PAT,LINEAR_API_KEY,SENTRY_AUTH_TOKEN) are set in the container env - The MCP server URL is not blocked by any firewall or proxy
CI Integration Issues
Section titled “CI Integration Issues”PRs Not Being Created
Section titled “PRs Not Being Created”Symptoms: a task’s status is awaiting_ci and never advances even though CI actually ran.
- CI result not reaching Yak. For GitHub Actions, check Caddy/nginx logs for inbound requests to
/webhooks/ci/github. For Drone, check the scheduler/defaultworker logs — CI results are polled byyak:poll-drone-cievery minute, not pushed. - Wrong CI system. The repo’s
ci_systemmust match which CI is authoritative for that repo. A GitHub Actions webhook for a repo configured asdroneis silently dropped. - Branch name mismatch. The task’s
branch_namemust match what was pushed. Look at the task’s Debug section for the actual branch name. - GitHub App permissions. The App needs
Checks: ReadandPull requests: Read & Writeto receive check suite events and create PRs.
CI Keeps Failing On The Same Issue
Section titled “CI Keeps Failing On The Same Issue”Retries are capped at one. After two failed attempts, the task is marked failed and a human has to take over. If you see this pattern repeatedly for a specific repo:
- The repo’s
CLAUDE.mdlikely needs a rule that would have prevented the class of mistake - The Yak system prompt may need tuning for your team’s conventions (see Prompting → Customizing the System Prompt)
High Costs
Section titled “High Costs”Open https://{your-domain}/costs. The cost dashboard shows daily totals, per-source breakdown, and the 30-day trend.
Routing Layer Spike (Haiku/Sonnet)
Section titled “Routing Layer Spike (Haiku/Sonnet)”If routing-layer costs are climbing, the cause is almost always one of:
- Sentry alert storm — a noisy alert rule is creating many tasks. Check the
/taskspage filtered bysource: sentryfor a cluster of similar tasks. Tighten the alert rule in Sentry, or raise themin_eventsthreshold inconfig/yak.php. - Slack bot being over-mentioned — a user is pasting long threads that hit
@yak. Check the task list filtered bysource: slack. - Failed webhook retries — some services retry webhook delivery on 5xx responses, creating duplicate routing calls. The
UNIQUE(external_id, repo)constraint deduplicates tasks, but not routing analysis.
Implementation Layer (Claude Code)
Section titled “Implementation Layer (Claude Code)”Implementation cost is covered by the Claude Max subscription, not billed per token. The cost dashboard shows Claude Code usage for monitoring but it does not affect your bill.
Budget Enforcement
Section titled “Budget Enforcement”The EnsureDailyBudget job middleware fails new Claude Code jobs gracefully once the daily routing-layer budget is exceeded. Raise the limit via:
# In ansible/vault/secrets.yml or the Yak container env:YAK_DAILY_BUDGET=100Re-run Ansible or restart the container.
Health Check Failures
Section titled “Health Check Failures”The /health page (and the scheduled yak:healthcheck command) runs these checks every 15 minutes:
| Check | If failing |
|---|---|
| Queue worker running | Supervisord crash — docker restart yak |
| Last task completed within N hours | No traffic, or workers hung on a stuck task |
| All repos fetchable | Git auth issue — re-run yak:refresh-repos manually, check SSH key |
| Claude CLI responding | See Claude CLI errors above |
| Claude CLI authenticated | Token expired — run docker exec -it yak claude login |
| Enabled channel MCP servers reachable | Network issue or external service down |
Failed health checks post to Slack if the Slack channel is enabled. If Slack isn’t available, check the health page manually or set up external monitoring against /health.
Agent Environment Variables Not Visible
Section titled “Agent Environment Variables Not Visible”Symptoms: the agent can’t find a token that the repo needs at build time (e.g. npm install fails with 401 on a private registry).
Each task runs in its own Incus sandbox container. Sandboxes start from the base template snapshot and have no access to the yak app’s environment. Only variables explicitly listed in agent_extra_env are pushed into the sandbox.
Resolution
Section titled “Resolution”Add the token to agent_extra_env in your Ansible vault:
agent_extra_env: NODE_AUTH_TOKEN: "ghp_..."Redeploy and re-run the affected repo’s setup task — the new env vars are baked into the next snapshot.
To verify the var is set inside a running sandbox:
incus exec task-<id> -- printenv NODE_AUTH_TOKENPrivate Docker Images Fail to Pull
Section titled “Private Docker Images Fail to Pull”Symptoms: setup or task execution fails with docker pull errors like unauthorized or denied: requested access to the resource is denied when the repo’s docker-compose.yml references images from a private registry (ghcr.io, a self-hosted registry, etc.).
Sandboxes start with no Docker authentication. Without credentials, the in-container Docker daemon can only pull public images. Base images shared across services typically live in private registries, so the first docker-compose up in setup fails and the whole snapshot never materialises.
Resolution
Section titled “Resolution”Add registry credentials to docker_registries in your Ansible vault:
docker_registries: ghcr.io: username: "your-github-username" password: "ghp_..." # PAT with `read:packages` scope registry.example.com: username: "deploy" password: "..."Redeploy. Ansible renders these into ~/.docker/config.json on the host, bind-mounts the file into the Yak container, and IncusSandboxManager pushes it into every sandbox at /home/yak/.docker/config.json. docker pull, docker-compose up, and BuildKit pick it up automatically — no docker login call needed inside the sandbox.
Re-run the affected repo’s setup task so the snapshot picks up the newly-cached images.
To verify the file landed inside a running sandbox:
incus exec task-<id> -- cat /home/yak/.docker/config.jsonCredentials live on the host as 0600 and inside the sandbox as 0600 owned by yak:yak, so they aren’t readable by other processes the agent spawns.
Emergency: Kill Everything And Restart
Section titled “Emergency: Kill Everything And Restart”If Yak is in a bad state and you can’t figure out what’s wrong:
# Stop the container gracefully (in-flight tasks finish first)docker stop yak
# Start it againdocker start yak
# Verifydocker exec yak php artisan yak:healthcheckMariaDB runs as a separate container (yak-mariadb) and is unaffected by Yak container restarts. Repo clones and the Claude session token persist via mounted volumes. Nothing is lost.
If supervisord itself is wedged, restart the container outright:
docker restart yakThis is safe — the queues are MariaDB-backed and any in-flight jobs will be retried on the next worker boot (with the caveat that tasks mid-claude -p session may be left in running and need manual reset per the earlier section).
MariaDB Issues
Section titled “MariaDB Issues”Container Not Starting
Section titled “Container Not Starting”docker logs yak-mariadb --tail 50Common causes: data directory permissions, port 3306 already in use, or corrupted InnoDB tablespace.
Connection Refused From Yak
Section titled “Connection Refused From Yak”Verify both containers are on the same Docker network:
docker network inspect yakBoth yak and yak-mariadb should appear. If not, re-run Ansible.
Resetting The Database
Section titled “Resetting The Database”docker stop yak-mariadbrm -rf /home/yak/mariadb-data/*docker start yak-mariadb# Wait for init, then re-run migrationsdocker exec yak php artisan migrate --forceBranch deployments
Section titled “Branch deployments”A PR didn’t get a preview URL
Section titled “A PR didn’t get a preview URL”Check, in order:
- Is
deployments_enabledtrue on the repository row? If not, the webhook is a no-op. - Did
DeployBranchJobfail? Look at thebranch_deploymentsrow. Ifstatus = failed,failure_reasonhas the cause. - Is SetupYakJob done for this repo? If
current_template_version = 0, there is no versioned snapshot to clone from. Run setup first. - Is the
yak-deploymentsqueue being processed?supervisorctl statuson the Yak host.
Preview shows a loading shim that never resolves
Section titled “Preview shows a loading shim that never resolves”- Check the deployment detail page for the failure reason and which phase stalled.
- Check the Incus host: is
incus listshowing the container? Is itRUNNINGorSTOPPED? - Look at the sandbox’s internal logs:
incus exec deploy-<id> -- journalctl -xeor rundocker compose logsinside the sandbox.
Template snapshots accumulating on disk
Section titled “Template snapshots accumulating on disk”Check the hourly deployments:gc-template-snapshots schedule: is it running? (php artisan schedule:list lists it.) Any recent warnings in storage/logs/laravel.log?
Manual cleanup of a specific version: incus snapshot delete yak-tpl-<repo>/ready-v<n>, but only if no branch_deployments row has template_version = n.
”At capacity” responses on a wake attempt
Section titled “”At capacity” responses on a wake attempt”Every running deployment was active within the last 5 minutes; eviction was refused. Either wait for someone to stop using a preview, bump YAK_DEPLOYMENTS_RUNNING_CAP, or incus stop one manually.
Collecting Diagnostics For A Bug Report
Section titled “Collecting Diagnostics For A Bug Report”When filing an issue, collect:
# Versiondocker exec yak git rev-parse HEAD
# Health check outputdocker exec yak php artisan yak:healthcheck
# Recent logsdocker logs yak --tail 500 > yak.log
# The affected task's full dashboard page (screenshot or save as HTML)# https://{your-domain}/tasks/{id}File at https://github.com/geocodio/yak/issues/new/choose — include the version, health output, the task ID, and which channel was involved.