Skip to content

Troubleshooting

Common problems and how to diagnose them. Start with the health page at https://{your-domain}/health — it covers most of what could be wrong in one place.

Before diving into a specific problem, run these three commands. They catch the majority of issues in under a minute.

Terminal window
# 1. Health check — queue workers, Claude CLI, MCP servers, repo fetchability
docker exec yak php artisan yak:healthcheck
# 2. Recent application logs
docker logs yak --tail 100
# 3. Queue status
docker exec yak php artisan queue:monitor yak-claude,default

If the health check is green and logs are clean, the problem is usually in the external service (Slack app, Linear webhook, GitHub App installation) rather than Yak itself.

Symptoms: a PR opened on a repo with pr_review_enabled = true but no Yak review comment appears.

  1. GitHub webhook reaching Yak? Check webhook delivery in GitHub App settings. pull_request.opened and pull_request.synchronize must be subscribed.
  2. Repo active and enabled? Go to /repos/{id}/edit and confirm Active and PR Review are both on.
  3. PR a draft, or authored by yak-bot? Both are intentionally skipped. Convert the draft to ready-for-review to trigger.
  4. Task dispatched but failed? Look in /tasks?tab=reviews for a failed row. Common failure modes:
    • Sandbox checkout failure — the PR’s head wasn’t fetchable (force-pushed, branch deleted). Inspect the task activity log.
    • Claude output didn’t contain a valid JSON block — usually means Claude failed the review instead of producing findings. The raw output lives in the task’s result_summary.
  5. Path filters too aggressive? If the PR only touches excluded paths, the review is filtered out. Check pr_review_path_excludes on the repo.

Manual re-run from the TaskDetail page’s Re-run review button. If the same task keeps failing, lower the prompt’s max_findings_per_review or check the tasks-review prompt at /prompts for custom edits.

Symptoms: a task’s status on the dashboard is running and hasn’t moved for several minutes beyond the expected Claude Code duration (typically 2–10 minutes).

  1. Queue worker crashed mid-task. Supervisord will restart the worker, but the task row remains in running until a human resets it.
  2. Claude CLI hung on network I/O. MCP server call, docker-compose bringing up a service, or a very long test suite.
  3. Budget exhausted silently. The task hit --max-budget-usd and the job didn’t update status cleanly.
Terminal window
# Is the Claude queue worker actually running?
docker exec yak ps aux | grep "queue:work"
# Any recent errors in the logs?
docker logs yak --tail 200 | grep -i error
# What does the task's own timeline say? Look at its detail page:
# https://{your-domain}/tasks/{id}
# The Debug section at the bottom has session ID, cost, turns, full Claude output.

If the worker is running but the task is stale, it will eventually time out on the 600s yak-claude queue timeout. To manually fail a stuck task:

Terminal window
docker exec yak php artisan tinker
# >>> App\Models\YakTask::find($id)->update(['status' => 'failed', 'error_log' => 'Manual reset']);

Manually resetting a task does not restart it. If you want Yak to try again, create a new task with the same description.

Symptoms: adding a new repo via the dashboard or Ansible, and the repo’s setup_status goes from running to failed.

  • Missing system dependencies — the repo’s dev environment needs a tool that isn’t in the Yak Docker image (e.g., a specific Node version, or pg_config).
  • Docker-in-Docker issues — the repo’s docker-compose.yml references services that need privileged mode or specific network configuration.
  • Private package registry auth — the repo uses a private npm/composer registry and the token isn’t configured. See Agent Environment Variables below.
  • CLAUDE.md missing or misleading — Claude couldn’t figure out the correct setup commands from the README alone.
  1. Open the setup task’s detail page: https://{your-domain}/tasks/{setup_task_id}
  2. Expand the Debug section at the bottom
  3. Read the full Claude Code output — it reports exactly what command failed and why

The task detail page shows the Claude session as collapsible CI-style steps. Each turn includes the tool type, a description, duration, and the full terminal output.

Fix the underlying issue, then re-run:

Terminal window
docker exec yak php artisan yak:setup-repo {slug}

Or click Re-run Setup on the repo’s edit page.

If the issue is CLAUDE.md coverage, update the CLAUDE.md file in the target repo with the specific commands Yak got wrong. See Repositories → CLAUDE.md.

Symptoms: you @yak in Slack (or assign a Linear issue to Yak, or trigger a Sentry alert) and nothing happens. No task appears on the dashboard.

  1. Is the channel actually enabled? The webhook endpoint only exists if the channel’s credentials are present in the vault and Ansible has been re-run since they were added.

    Terminal window
    docker exec yak php artisan route:list | grep webhooks

    Only enabled channels appear. If /webhooks/slack is missing, Slack is not enabled.

  2. Can the external service reach your server? Test from the internet:

    Terminal window
    curl -I https://{your-domain}/webhooks/slack
    # Expect: 200 (OK) or 401 (signature required), NOT 404 or connection refused
  3. Check Caddy / nginx logs for the incoming request:

    Terminal window
    docker exec yak tail -f /var/log/caddy/access.log
  4. Check signing secrets match. Slack rejects with 401 if slack_signing_secret doesn’t match the app’s signing secret. Linear and Sentry do the same with their respective webhook secrets.

  5. UFW rules — port 443 must be open for inbound HTTPS:

    Terminal window
    ssh root@{server} ufw status
  • Slack — channel history scope is required for thread reply matching. If clarification replies don’t route to the right task, verify channels:history is in the bot scopes.
  • Linear — the webhook must subscribe to Agent session events so delegation events come through, and the OAuth connection must be active; if the linear_oauth_connections.installer_user_id column is null, re-authorize the app from Yak’s settings. The install requires workspace admin approval — a non-admin install will appear to succeed but agent session events will not arrive.
  • Sentry — alerts must be tagged yak-eligible. Alerts without the tag are ignored even if they hit the webhook.
  • GitHub — the App must be installed on the target org and must have webhook events for check_suite.completed and pull_request.closed.
Terminal window
docker exec yak claude --version
docker exec yak claude -p "Say hello" --output-format json

If the first command fails, the CLI isn’t installed in the container — rebuild the Docker image. If the second command hangs or errors, the CLI is installed but can’t reach Anthropic — check network connectivity.

Symptoms: tasks fail with error_log mentioning auth, 401, or “token expired”. The health check posts an alert to Slack.

Claude Code authenticates via an interactive claude login session token stored in /home/yak/.claude/. When the token expires, every Claude Code job fails gracefully with an auth error — tasks are marked failed, a notification goes to the source, and the health check raises an alert.

Resolution:

Terminal window
docker exec -it yak claude login

Follow the browser-based OAuth flow. The new session token persists in the mounted volume and takes effect immediately. No restart is needed.

Terminal window
docker exec yak cat /home/yak/mcp-config.json

This shows which MCP servers are currently configured. If a server you expect is missing, the corresponding channel isn’t enabled — re-run Ansible with the channel’s credentials set.

If a server is configured but Claude can’t reach it, check:

  • Network connectivity from the Yak container
  • Credentials (GITHUB_PAT, LINEAR_API_KEY, SENTRY_AUTH_TOKEN) are set in the container env
  • The MCP server URL is not blocked by any firewall or proxy

Symptoms: a task’s status is awaiting_ci and never advances even though CI actually ran.

  1. CI result not reaching Yak. For GitHub Actions, check Caddy/nginx logs for inbound requests to /webhooks/ci/github. For Drone, check the scheduler/default worker logs — CI results are polled by yak:poll-drone-ci every minute, not pushed.
  2. Wrong CI system. The repo’s ci_system must match which CI is authoritative for that repo. A GitHub Actions webhook for a repo configured as drone is silently dropped.
  3. Branch name mismatch. The task’s branch_name must match what was pushed. Look at the task’s Debug section for the actual branch name.
  4. GitHub App permissions. The App needs Checks: Read and Pull requests: Read & Write to receive check suite events and create PRs.

Retries are capped at one. After two failed attempts, the task is marked failed and a human has to take over. If you see this pattern repeatedly for a specific repo:

Open https://{your-domain}/costs. The cost dashboard shows daily totals, per-source breakdown, and the 30-day trend.

If routing-layer costs are climbing, the cause is almost always one of:

  • Sentry alert storm — a noisy alert rule is creating many tasks. Check the /tasks page filtered by source: sentry for a cluster of similar tasks. Tighten the alert rule in Sentry, or raise the min_events threshold in config/yak.php.
  • Slack bot being over-mentioned — a user is pasting long threads that hit @yak. Check the task list filtered by source: slack.
  • Failed webhook retries — some services retry webhook delivery on 5xx responses, creating duplicate routing calls. The UNIQUE(external_id, repo) constraint deduplicates tasks, but not routing analysis.

Implementation cost is covered by the Claude Max subscription, not billed per token. The cost dashboard shows Claude Code usage for monitoring but it does not affect your bill.

The EnsureDailyBudget job middleware fails new Claude Code jobs gracefully once the daily routing-layer budget is exceeded. Raise the limit via:

Terminal window
# In ansible/vault/secrets.yml or the Yak container env:
YAK_DAILY_BUDGET=100

Re-run Ansible or restart the container.

The /health page (and the scheduled yak:healthcheck command) runs these checks every 15 minutes:

CheckIf failing
Queue worker runningSupervisord crash — docker restart yak
Last task completed within N hoursNo traffic, or workers hung on a stuck task
All repos fetchableGit auth issue — re-run yak:refresh-repos manually, check SSH key
Claude CLI respondingSee Claude CLI errors above
Claude CLI authenticatedToken expired — run docker exec -it yak claude login
Enabled channel MCP servers reachableNetwork issue or external service down

Failed health checks post to Slack if the Slack channel is enabled. If Slack isn’t available, check the health page manually or set up external monitoring against /health.

Symptoms: the agent can’t find a token that the repo needs at build time (e.g. npm install fails with 401 on a private registry).

Each task runs in its own Incus sandbox container. Sandboxes start from the base template snapshot and have no access to the yak app’s environment. Only variables explicitly listed in agent_extra_env are pushed into the sandbox.

Add the token to agent_extra_env in your Ansible vault:

agent_extra_env:
NODE_AUTH_TOKEN: "ghp_..."

Redeploy and re-run the affected repo’s setup task — the new env vars are baked into the next snapshot.

To verify the var is set inside a running sandbox:

Terminal window
incus exec task-<id> -- printenv NODE_AUTH_TOKEN

Symptoms: setup or task execution fails with docker pull errors like unauthorized or denied: requested access to the resource is denied when the repo’s docker-compose.yml references images from a private registry (ghcr.io, a self-hosted registry, etc.).

Sandboxes start with no Docker authentication. Without credentials, the in-container Docker daemon can only pull public images. Base images shared across services typically live in private registries, so the first docker-compose up in setup fails and the whole snapshot never materialises.

Add registry credentials to docker_registries in your Ansible vault:

docker_registries:
ghcr.io:
username: "your-github-username"
password: "ghp_..." # PAT with `read:packages` scope
registry.example.com:
username: "deploy"
password: "..."

Redeploy. Ansible renders these into ~/.docker/config.json on the host, bind-mounts the file into the Yak container, and IncusSandboxManager pushes it into every sandbox at /home/yak/.docker/config.json. docker pull, docker-compose up, and BuildKit pick it up automatically — no docker login call needed inside the sandbox.

Re-run the affected repo’s setup task so the snapshot picks up the newly-cached images.

To verify the file landed inside a running sandbox:

Terminal window
incus exec task-<id> -- cat /home/yak/.docker/config.json

Credentials live on the host as 0600 and inside the sandbox as 0600 owned by yak:yak, so they aren’t readable by other processes the agent spawns.

If Yak is in a bad state and you can’t figure out what’s wrong:

Terminal window
# Stop the container gracefully (in-flight tasks finish first)
docker stop yak
# Start it again
docker start yak
# Verify
docker exec yak php artisan yak:healthcheck

MariaDB runs as a separate container (yak-mariadb) and is unaffected by Yak container restarts. Repo clones and the Claude session token persist via mounted volumes. Nothing is lost.

If supervisord itself is wedged, restart the container outright:

Terminal window
docker restart yak

This is safe — the queues are MariaDB-backed and any in-flight jobs will be retried on the next worker boot (with the caveat that tasks mid-claude -p session may be left in running and need manual reset per the earlier section).

Terminal window
docker logs yak-mariadb --tail 50

Common causes: data directory permissions, port 3306 already in use, or corrupted InnoDB tablespace.

Verify both containers are on the same Docker network:

Terminal window
docker network inspect yak

Both yak and yak-mariadb should appear. If not, re-run Ansible.

Terminal window
docker stop yak-mariadb
rm -rf /home/yak/mariadb-data/*
docker start yak-mariadb
# Wait for init, then re-run migrations
docker exec yak php artisan migrate --force

Check, in order:

  1. Is deployments_enabled true on the repository row? If not, the webhook is a no-op.
  2. Did DeployBranchJob fail? Look at the branch_deployments row. If status = failed, failure_reason has the cause.
  3. Is SetupYakJob done for this repo? If current_template_version = 0, there is no versioned snapshot to clone from. Run setup first.
  4. Is the yak-deployments queue being processed? supervisorctl status on the Yak host.

Preview shows a loading shim that never resolves

Section titled “Preview shows a loading shim that never resolves”
  • Check the deployment detail page for the failure reason and which phase stalled.
  • Check the Incus host: is incus list showing the container? Is it RUNNING or STOPPED?
  • Look at the sandbox’s internal logs: incus exec deploy-<id> -- journalctl -xe or run docker compose logs inside the sandbox.

Check the hourly deployments:gc-template-snapshots schedule: is it running? (php artisan schedule:list lists it.) Any recent warnings in storage/logs/laravel.log?

Manual cleanup of a specific version: incus snapshot delete yak-tpl-<repo>/ready-v<n>, but only if no branch_deployments row has template_version = n.

”At capacity” responses on a wake attempt

Section titled “”At capacity” responses on a wake attempt”

Every running deployment was active within the last 5 minutes; eviction was refused. Either wait for someone to stop using a preview, bump YAK_DEPLOYMENTS_RUNNING_CAP, or incus stop one manually.

When filing an issue, collect:

Terminal window
# Version
docker exec yak git rev-parse HEAD
# Health check output
docker exec yak php artisan yak:healthcheck
# Recent logs
docker logs yak --tail 500 > yak.log
# The affected task's full dashboard page (screenshot or save as HTML)
# https://{your-domain}/tasks/{id}

File at https://github.com/geocodio/yak/issues/new/choose — include the version, health output, the task ID, and which channel was involved.