Troubleshooting

Common problems and how to diagnose them. Start with the health page at https://{your-domain}/health — it covers most of what could be wrong in one place.

First-Line Diagnostics

Before diving into a specific problem, run these three commands. They catch the majority of issues in under a minute.

# 1. Health check — queue workers, Claude CLI, MCP servers, repo fetchability
docker exec yak php artisan yak:healthcheck

# 2. Recent application logs
docker logs yak --tail 100

# 3. Queue status
docker exec yak php artisan queue:monitor yak-claude,default

If the health check is green and logs are clean, the problem is usually in the external service (Slack app, Linear webhook, GitHub App installation) rather than Yak itself.

PR Review Not Posting

Symptoms: a PR opened on a repo with pr_review_enabled = true but no Yak review comment appears.

Checklist

GitHub webhook reaching Yak? Check webhook delivery in GitHub App settings. pull_request.opened and pull_request.synchronize must be subscribed.
Repo active and enabled? Go to /repos/{id}/edit and confirm Active and PR Review are both on.
PR a draft, or authored by yak-bot? Both are intentionally skipped. Convert the draft to ready-for-review to trigger.
Task dispatched but failed? Look in /tasks?tab=reviews for a failed row. Common failure modes:
- Sandbox checkout failure — the PR’s head wasn’t fetchable (force-pushed, branch deleted). Inspect the task activity log.
- Claude output didn’t contain a valid JSON block — usually means Claude failed the review instead of producing findings. The raw output lives in the task’s result_summary.
Path filters too aggressive? If the PR only touches excluded paths, the review is filtered out. Check pr_review_path_excludes on the repo.

Resolution

Manual re-run from the TaskDetail page’s Re-run review button. If the same task keeps failing, lower the prompt’s max_findings_per_review or check the tasks-review prompt at /prompts for custom edits.

Task Stuck In `running`

Symptoms: a task’s status on the dashboard is running and hasn’t moved for several minutes beyond the expected Claude Code duration (typically 2–10 minutes).

Likely Causes

Queue worker crashed mid-task. Supervisord will restart the worker, but the task row remains in running until a human resets it.
Claude CLI hung on network I/O. MCP server call, docker-compose bringing up a service, or a very long test suite.
Budget exhausted silently. The task hit --max-budget-usd and the job didn’t update status cleanly.

Diagnosis

# Is the Claude queue worker actually running?
docker exec yak ps aux | grep "queue:work"

# Any recent errors in the logs?
docker logs yak --tail 200 | grep -i error

# What does the task's own timeline say? Look at its detail page:
#   https://{your-domain}/tasks/{id}
# The Debug section at the bottom has session ID, cost, turns, full Claude output.

Resolution

If the worker is running but the task is stale, it will eventually time out on the 600s yak-claude queue timeout. To manually fail a stuck task:

docker exec yak php artisan tinker
# >>> App\Models\YakTask::find($id)->update(['status' => 'failed', 'error_log' => 'Manual reset']);

Manually resetting a task does not restart it. If you want Yak to try again, create a new task with the same description.

Setup Task Fails For A Repo

Symptoms: adding a new repo via the dashboard or Ansible, and the repo’s setup_status goes from running to failed.

Common Causes

Missing system dependencies — the repo’s dev environment needs a tool that isn’t in the Yak Docker image (e.g., a specific Node version, or pg_config).
Docker-in-Docker issues — the repo’s docker-compose.yml references services that need privileged mode or specific network configuration.
Private package registry auth — the repo uses a private npm/composer registry and the token isn’t configured. See Agent Environment Variables below.
CLAUDE.md missing or misleading — Claude couldn’t figure out the correct setup commands from the README alone.

Diagnosis

Open the setup task’s detail page: https://{your-domain}/tasks/{setup_task_id}
Expand the Debug section at the bottom
Read the full Claude Code output — it reports exactly what command failed and why

The task detail page shows the Claude session as collapsible CI-style steps. Each turn includes the tool type, a description, duration, and the full terminal output.

Resolution

Fix the underlying issue, then re-run:

docker exec yak php artisan yak:setup-repo {slug}

Or click Re-run Setup on the repo’s edit page.

If the issue is CLAUDE.md coverage, update the CLAUDE.md file in the target repo with the specific commands Yak got wrong. See Repositories → CLAUDE.md.

Webhooks Not Arriving

Symptoms: you @yak in Slack (or assign a Linear issue to Yak, or trigger a Sentry alert) and nothing happens. No task appears on the dashboard.

Checklist

Is the channel actually enabled? The webhook endpoint only exists if the channel’s credentials are present in the vault and Ansible has been re-run since they were added.
Terminal window
```
docker exec yak php artisan route:list | grep webhooks
```
Only enabled channels appear. If /webhooks/slack is missing, Slack is not enabled.

Can the external service reach your server? Test from the internet:

curl -I https://{your-domain}/webhooks/slack
# Expect: 200 (OK) or 401 (signature required), NOT 404 or connection refused

Check Caddy / nginx logs for the incoming request:
Terminal window
```
docker exec yak tail -f /var/log/caddy/access.log
```
Check signing secrets match. Slack rejects with 401 if slack_signing_secret doesn’t match the app’s signing secret. Linear and Sentry do the same with their respective webhook secrets.
UFW rules — port 443 must be open for inbound HTTPS:
Terminal window
```
ssh root@{server} ufw status
```

Per-Channel Gotchas

Slack — channel history scope is required for thread reply matching. If clarification replies don’t route to the right task, verify channels:history is in the bot scopes.
Linear — the webhook must subscribe to Agent session events so delegation events come through, and the OAuth connection must be active; if the linear_oauth_connections.installer_user_id column is null, re-authorize the app from Yak’s settings. The install requires workspace admin approval — a non-admin install will appear to succeed but agent session events will not arrive.
Sentry — alerts must be tagged yak-eligible. Alerts without the tag are ignored even if they hit the webhook.
GitHub — the App must be installed on the target org and must have webhook events for check_suite.completed and pull_request.closed.

Claude CLI Errors

CLI Not Found Or Not Responding

docker exec yak claude --version
docker exec yak claude -p "Say hello" --output-format json

If the first command fails, the CLI isn’t installed in the container — rebuild the Docker image. If the second command hangs or errors, the CLI is installed but can’t reach Anthropic — check network connectivity.

Authentication Failures (Token Expired)

Symptoms: tasks fail with error_log mentioning auth, 401, or “token expired”. The health check posts an alert to Slack.

Claude Code authenticates via an interactive claude login session token stored in /home/yak/.claude/. When the token expires, every Claude Code job fails gracefully with an auth error — tasks are marked failed, a notification goes to the source, and the health check raises an alert.

Resolution:

docker exec -it yak claude login

Follow the browser-based OAuth flow. The new session token persists in the mounted volume and takes effect immediately. No restart is needed.

MCP Server Connection Issues

docker exec yak cat /home/yak/mcp-config.json

This shows which MCP servers are currently configured. If a server you expect is missing, the corresponding channel isn’t enabled — re-run Ansible with the channel’s credentials set.

If a server is configured but Claude can’t reach it, check:

Network connectivity from the Yak container
Credentials (GITHUB_PAT, LINEAR_API_KEY, SENTRY_AUTH_TOKEN) are set in the container env
The MCP server URL is not blocked by any firewall or proxy

CI Integration Issues

PRs Not Being Created

Symptoms: a task’s status is awaiting_ci and never advances even though CI actually ran.

CI result not reaching Yak. For GitHub Actions, check Caddy/nginx logs for inbound requests to /webhooks/ci/github. For Drone, check the scheduler/default worker logs — CI results are polled by yak:poll-drone-ci every minute, not pushed.
Wrong CI system. The repo’s ci_system must match which CI is authoritative for that repo. A GitHub Actions webhook for a repo configured as drone is silently dropped.
Branch name mismatch. The task’s branch_name must match what was pushed. Look at the task’s Debug section for the actual branch name.
GitHub App permissions. The App needs Checks: Read and Pull requests: Read & Write to receive check suite events and create PRs.

CI Keeps Failing On The Same Issue

Retries are capped at one. After two failed attempts, the task is marked failed and a human has to take over. If you see this pattern repeatedly for a specific repo:

The repo’s CLAUDE.md likely needs a rule that would have prevented the class of mistake
The Yak system prompt may need tuning for your team’s conventions (see Prompting → Customizing the System Prompt)

High Costs

Open https://{your-domain}/costs. The cost dashboard shows daily totals, per-source breakdown, and the 30-day trend.

Routing Layer Spike (Haiku/Sonnet)

If routing-layer costs are climbing, the cause is almost always one of:

Sentry alert storm — a noisy alert rule is creating many tasks. Check the /tasks page filtered by source: sentry for a cluster of similar tasks. Tighten the alert rule in Sentry, or raise the min_events threshold in config/yak.php.
Slack bot being over-mentioned — a user is pasting long threads that hit @yak. Check the task list filtered by source: slack.
Failed webhook retries — some services retry webhook delivery on 5xx responses, creating duplicate routing calls. The UNIQUE(external_id, repo) constraint deduplicates tasks, but not routing analysis.

Implementation Layer (Claude Code)

Implementation cost is covered by the Claude Max subscription, not billed per token. The cost dashboard shows Claude Code usage for monitoring but it does not affect your bill.

Budget Enforcement

The EnsureDailyBudget job middleware fails new Claude Code jobs gracefully once the daily routing-layer budget is exceeded. Raise the limit via:

# In ansible/vault/secrets.yml or the Yak container env:
YAK_DAILY_BUDGET=100

Re-run Ansible or restart the container.

Health Check Failures

The /health page (and the scheduled yak:healthcheck command) runs these checks every 15 minutes:

Check	If failing
Queue worker running	Supervisord crash — `docker restart yak`
Last task completed within N hours	No traffic, or workers hung on a stuck task
All repos fetchable	Git auth issue — re-run `yak:refresh-repos` manually, check SSH key
Claude CLI responding	See Claude CLI errors above
Claude CLI authenticated	Token expired — run `docker exec -it yak claude login`
Enabled channel MCP servers reachable	Network issue or external service down

Failed health checks post to Slack if the Slack channel is enabled. If Slack isn’t available, check the health page manually or set up external monitoring against /health.

Agent Environment Variables Not Visible

Symptoms: the agent can’t find a token that the repo needs at build time (e.g. npm install fails with 401 on a private registry).

Cause

Each task runs in its own Incus sandbox container. Sandboxes start from the base template snapshot and have no access to the yak app’s environment. Only variables explicitly listed in agent_extra_env are pushed into the sandbox.

Resolution

Add the token to agent_extra_env in your Ansible vault:

agent_extra_env:
  NODE_AUTH_TOKEN: "ghp_..."

Redeploy and re-run the affected repo’s setup task — the new env vars are baked into the next snapshot.

To verify the var is set inside a running sandbox:

incus exec task-<id> -- printenv NODE_AUTH_TOKEN

Private Docker Images Fail to Pull

Symptoms: setup or task execution fails with docker pull errors like unauthorized or denied: requested access to the resource is denied when the repo’s docker-compose.yml references images from a private registry (ghcr.io, a self-hosted registry, etc.).

Cause

Sandboxes start with no Docker authentication. Without credentials, the in-container Docker daemon can only pull public images. Base images shared across services typically live in private registries, so the first docker-compose up in setup fails and the whole snapshot never materialises.

Resolution

Add registry credentials to docker_registries in your Ansible vault:

docker_registries:
  ghcr.io:
    username: "your-github-username"
    password: "ghp_..."            # PAT with `read:packages` scope
  registry.example.com:
    username: "deploy"
    password: "..."

Redeploy. Ansible renders these into ~/.docker/config.json on the host, bind-mounts the file into the Yak container, and IncusSandboxManager pushes it into every sandbox at /home/yak/.docker/config.json. docker pull, docker-compose up, and BuildKit pick it up automatically — no docker login call needed inside the sandbox.

Re-run the affected repo’s setup task so the snapshot picks up the newly-cached images.

To verify the file landed inside a running sandbox:

incus exec task-<id> -- cat /home/yak/.docker/config.json

Credentials live on the host as 0600 and inside the sandbox as 0600 owned by yak:yak, so they aren’t readable by other processes the agent spawns.

Emergency: Kill Everything And Restart

If Yak is in a bad state and you can’t figure out what’s wrong:

# Stop the container gracefully (in-flight tasks finish first)
docker stop yak

# Start it again
docker start yak

# Verify
docker exec yak php artisan yak:healthcheck

MariaDB runs as a separate container (yak-mariadb) and is unaffected by Yak container restarts. Repo clones and the Claude session token persist via mounted volumes. Nothing is lost.

If supervisord itself is wedged, restart the container outright:

docker restart yak

This is safe — the queues are MariaDB-backed and any in-flight jobs will be retried on the next worker boot (with the caveat that tasks mid-claude -p session may be left in running and need manual reset per the earlier section).

MariaDB Issues

Container Not Starting

docker logs yak-mariadb --tail 50

Common causes: data directory permissions, port 3306 already in use, or corrupted InnoDB tablespace.

Connection Refused From Yak

Verify both containers are on the same Docker network:

docker network inspect yak

Both yak and yak-mariadb should appear. If not, re-run Ansible.

Resetting The Database

docker stop yak-mariadb
rm -rf /home/yak/mariadb-data/*
docker start yak-mariadb
# Wait for init, then re-run migrations
docker exec yak php artisan migrate --force

Branch deployments

A PR didn’t get a preview URL

Check, in order:

Is deployments_enabled true on the repository row? If not, the webhook is a no-op.
Did DeployBranchJob fail? Look at the branch_deployments row. If status = failed, failure_reason has the cause.
Is SetupYakJob done for this repo? If current_template_version = 0, there is no versioned snapshot to clone from. Run setup first.
Is the yak-deployments queue being processed? supervisorctl status on the Yak host.

Preview shows a loading shim that never resolves

Check the deployment detail page for the failure reason and which phase stalled.
Check the Incus host: is incus list showing the container? Is it RUNNING or STOPPED?
Look at the sandbox’s internal logs: incus exec deploy-<id> -- journalctl -xe or run docker compose logs inside the sandbox.

Template snapshots accumulating on disk

Check the hourly deployments:gc-template-snapshots schedule: is it running? (php artisan schedule:list lists it.) Any recent warnings in storage/logs/laravel.log?

Manual cleanup of a specific version: incus snapshot delete yak-tpl-<repo>/ready-v<n>, but only if no branch_deployments row has template_version = n.

”At capacity” responses on a wake attempt

Every running deployment was active within the last 5 minutes; eviction was refused. Either wait for someone to stop using a preview, bump YAK_DEPLOYMENTS_RUNNING_CAP, or incus stop one manually.

Collecting Diagnostics For A Bug Report

When filing an issue, collect:

# Version
docker exec yak git rev-parse HEAD

# Health check output
docker exec yak php artisan yak:healthcheck

# Recent logs
docker logs yak --tail 500 > yak.log

# The affected task's full dashboard page (screenshot or save as HTML)
# https://{your-domain}/tasks/{id}

File at https://github.com/geocodio/yak/issues/new/choose — include the version, health output, the task ID, and which channel was involved.

Troubleshooting

First-Line Diagnostics

PR Review Not Posting

Checklist

Resolution

Task Stuck In running

Likely Causes

Diagnosis

Resolution

Setup Task Fails For A Repo

Common Causes

Diagnosis

Resolution

Webhooks Not Arriving

Checklist

Per-Channel Gotchas

Claude CLI Errors

CLI Not Found Or Not Responding

Authentication Failures (Token Expired)

MCP Server Connection Issues

CI Integration Issues

PRs Not Being Created

CI Keeps Failing On The Same Issue

High Costs

Routing Layer Spike (Haiku/Sonnet)

Implementation Layer (Claude Code)

Budget Enforcement

Health Check Failures

Agent Environment Variables Not Visible

Cause

Resolution

Private Docker Images Fail to Pull

Cause

Resolution

Emergency: Kill Everything And Restart

MariaDB Issues

Container Not Starting

Connection Refused From Yak

Resetting The Database

Branch deployments

A PR didn’t get a preview URL

Preview shows a loading shim that never resolves

Template snapshots accumulating on disk

”At capacity” responses on a wake attempt

Collecting Diagnostics For A Bug Report

Task Stuck In `running`