Troubleshooting Guide

Agent Can’t Connect to Server

Check server_url includes scheme (http:// or https://)
Verify firewall allows outbound on server port
Validate API key: curl -H "X-API-Key: $KEY" $URL/health
For remote agents, ensure the Agent API path is reachable through Traefik

Jobs Stuck in “in_progress” or “queued”

TaskEngine handles timeouts automatically (30 seconds default)
Jobs auto-reset on agent disconnect via orphan detection
Jobs stuck in queued indicate no agent is claiming work — check agent heartbeats
Jobs stuck in queued for a specific organization: verify the organization has a healthy agent deployed. After the org-scoping fix, agents only claim jobs for their own organization. The Client UI dashboard and job pages show a warning when no active agents are detected.
Manual reset:

UPDATE gather_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';
UPDATE scrape_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';

GitHub Rate Limiting

Scenario	Rate Limit
Without token	60 requests/hour
With token	5,000 requests/hour

Set the GITHUB_TOKEN environment variable to increase limits.

Migration Errors

Check Postgres is running: docker ps | grep postgres
Verify database credentials in .env file
Check migration files for syntax errors
Review server logs: docker compose logs api

UI Not Updating

Run bazel run //toolchains/templ -- generate after .templ file changes
Run (handled by Hugo build) after CSS changes
Clear browser cache
Check for templ compilation errors in the build output

Notification Delivery Failures

Check channel configuration (URL, headers, auth)
Review delivery logs: GET /notification-deliveries?status=failed
Check dead letter queue: GET /notification-deliveries/dead
Retry failed deliveries: POST /notification-deliveries/{id}/retry
Verify NOTIFICATION_ALLOW_PRIVATE_URLS if targeting internal endpoints

Database Connection Issues

Verify PG_* environment variables are set correctly
Check PostgreSQL health: pg_isready -U $PG_USER -d $PG_DBNAME
Ensure PG_SSLMODE matches your PostgreSQL configuration
Check Docker network connectivity between services

Traefik Routing Issues

Public routes: Check docker/traefik/dynamic-public.yml
Internal routes: Check docker/traefik/dynamic-internal.yml
Verify service names match between compose and traefik configs
Check Traefik dashboard at localhost:8082 for routing errors

Symptom: User logs in successfully (middleware passes) but gets redirected back to /login when accessing protected pages.

This is almost never a cookie/browser/Traefik issue. If auth middleware (RequireAuth, RequireGlobalAdmin) passes, the redirect comes from the handler itself.

Debugging approach — read code, don’t add logging:

Identify which handler serves the page (check internal/handlers/ui.go)
Read the handler — most call getAPIClient(c, cfg) as their first action
Read getAPIClient() in internal/handlers/form_helpers.go — it returns nil, nil when no auth credentials match
If client == nil && err == nil, the handler redirects to login — this is the redirect source

Common causes:

Auth Mode	Why getAPIClient fails	Fix
Internal UI + Supabase	Has JWT session but `orgID = 0` (internal UI skips org). JWT path requires `orgID > 0`.	Use `NewAPIClientWithJWTInternal` when `cfg.IsInternal` and session exists
Client UI + API key	No API key cookie set (user used Supabase login)	Ensure JWT path runs before API key fallback
Client UI + Supabase, no org	User has no org memberships yet	RequireOnboarded should redirect to `/onboarding` before handler runs

Key files to check in order:

File	What to verify
`internal/handlers/form_helpers.go:getAPIClient()`	Which auth path runs? All paths covered?
`internal/services/api_client.go`	What headers does the API client send?
`pkg/api/middleware_auth.go`	Does the API server accept the request? Is org required?
`internal/config/config.go`	Is `IsInternal` set correctly? Are paths correct?

Rule of thumb: If auth middleware logs show success but the page redirects, the problem is in the handler-level API client creation, not in authentication itself. Always read the handler code before adding debug logging.

Maintenance Mode & Outage Page

Enabling Planned Maintenance

Set MAINTENANCE_MODE=true on the clientui service and restart:

# In go/planekeeper/docker/docker-compose.yml or .env
MAINTENANCE_MODE=true

docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui

All protected Client UI routes will show a “Scheduled Maintenance” page (HTTP 503). The Internal UI is unaffected.

Disabling Maintenance Mode

Set MAINTENANCE_MODE=false (or remove it) and restart:

MAINTENANCE_MODE=false
docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui

Use the Internal UI Maintenance page to set an announcement banner that appears as a modal on all Client UI pages. The banner is cached and appears within 60 seconds. Clear it from the same page after maintenance completes.

If the Internal UI is unavailable, set the banner directly via SQL:

INSERT INTO settings (organization_id, key, value, description)
VALUES (0, 'maintenance_banner', 'warn:Database upgrade in progress — estimated duration: 60 minutes', 'Public maintenance announcement')
ON CONFLICT (organization_id, key) DO UPDATE SET value = EXCLUDED.value;

To clear:

UPDATE settings SET value = '' WHERE organization_id = 0 AND key = 'maintenance_banner';

Automatic Outage Detection

The Client UI background health checker pings the API’s /health endpoint every 10 seconds. After 2 consecutive failures (20 seconds), protected routes automatically show a “Service Temporarily Unavailable” page. Recovery is immediate on the first successful health check.

Permissions Not Working As Expected

The permission system uses Casbin v3 with a SyncedCachedEnforcer that caches enforcement results in memory. Most permission issues trace to either stale cache entries or missing role assignments.

Diagnostic Queries

-- What roles does a user have in an org?
SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';

-- What policy rules exist for a role?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v0 = 'role:<ROLE_KEY>';

-- What deny rules exist?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v4 = 'deny';

-- Is a user a superadmin?
SELECT * FROM casbin_rule WHERE ptype = 'g2' AND v0 = 'user:<USER_ID>';

Deny Rules Not Being Enforced

Symptom: A custom role has Deny statements in its policy document, but the user can still perform the denied actions.

Cause 1 — Role not assigned to user (most common): Creating a role with deny rules does not automatically apply those rules to anyone. The user must be explicitly assigned to that role via the Members page. Casbin evaluates rules per-role — a deny rule on role:deny-custom has no effect on a user who only has role:admin.

The check: Verify the user has a g rule for the deny role:

SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';

If only role:admin appears, the deny role is not assigned. Assign it via the Members page — users can have multiple roles, and the effective permissions are the union of all roles with deny-wins semantics.

Cause 2 — Stale enforcement cache: The SyncedCachedEnforcer caches Enforce() results keyed by request parameters (sub, dom, obj, act). When new policy rules are added via AddPolicies(), Casbin attempts to invalidate related cache entries but uses policy-rule-format keys instead of request-parameter-format keys — so stale “allow” results persist until they expire.

The fix: Cache entries expire after 30 seconds (SetExpireTime in pkg/permissions/enforcer.go). Wait 30 seconds and retry. If running a version without the SetExpireTime fix, restart the API server to clear the in-memory cache.

New User Has No Permissions After Creating an Org

Symptom: A user creates a new organization during onboarding, but all UI elements are grayed out (permission-gated). Logging out and back in does not help.

Root cause: The Casbin SyncedCachedEnforcer has a Go embedding bug — StartAutoLoadPolicy (defined on SyncedEnforcer) dispatches LoadPolicy() to SyncedEnforcer.LoadPolicy() instead of SyncedCachedEnforcer.LoadPolicy(). This means auto-reload fetches new policies from the database but does NOT clear the enforcement result cache. The user’s initial “deny” result (from before the g rule was created during onboarding) was cached and never expired.

The fix: pkg/permissions/enforcer.go sets e.SetExpireTime(autoReloadInterval) so cached results auto-expire every 30 seconds. After the auto-reload picks up the new g rule and the stale cache entry expires, the user gets their permissions.

If the issue persists: Restart the API server to force a full cache clear.

Permission Changes Take Up to 30 Seconds

This is expected behavior. The enforcer caches enforcement results for 30 seconds (autoReloadInterval). After changing role assignments or policy documents:

The database is updated immediately
The auto-reload picks up changes within 30 seconds
Stale cached enforcement results expire within 30 seconds

Worst case: up to 60 seconds for a policy change to take full effect (30s reload + 30s cache expiry).

Understanding the Casbin Model

The permission model uses domain-based RBAC with deny support:

Request:  (subject, domain, object, action)    e.g., ("user:6", "org:2", "gather-jobs", "list")
Policy:   (role, domain, object, action, effect) e.g., ("role:admin", "*", "*", "*", "allow")
Role:     g rules (org-scoped) and g2 rules (global/superadmin)
Effect:   allow if ANY rule allows AND NO rule denies (deny always wins)

How deny-wins works: When evaluating Enforce("user:6", "org:2", "gather-jobs", "list"):

Casbin finds ALL matching policy rules across ALL of the user’s assigned roles
If any matching rule has effect=deny, the result is denied — regardless of how many allow rules match
This means a user with both role:admin (allow all) and role:restricted (deny gather-jobs) will be denied gather-jobs access

Wildcard matching: * in policy rules matches any value:

role:admin, *, *, *, allow → allows everything in every org
role:custom, org:2, gather-jobs, *, deny → denies all gather-jobs actions in org 2
role:custom, org:2, *, get, allow → allows get on all resources in org 2

Key Files

File	Purpose
`pkg/permissions/enforcer.go`	Enforcer creation, cache config, `GetEffectiveActions()`
`pkg/permissions/adapter.go`	Loads/saves Casbin rules from PostgreSQL
`pkg/permissions/middleware.go`	API route → permission check mapping
`pkg/permissions/actions.go`	Route-to-permission map, all registered actions
`pkg/api/roles_handlers.go`	Policy document ↔ Casbin rule conversion

Known Issues & Potential Conflicts

This section documents known gaps, partially implemented features, and items that may require attention.

Unimplemented Notification Channel Types

The database enum notification_channel_type defines four values: webhook, pagerduty, telegram, smtp. However, only webhook has an actual sender implementation in pkg/notifications/webhook/. The other three are schema placeholders. The UI restricts channel creation to webhook only, but channels with unsupported types can still be created via the API directly.

EarlyAccess Separate Schema

The earlyaccess binary uses its own PostgreSQL schema (earlyaccess.waitlist_requests) rather than the main application schema. This means:

The earlyaccess service manages its own migrations independently
The main server migration system does not touch the earlyaccess schema
Backup strategies need to account for both schemas

Version Transform Column

Migration 011 adds a version_transform column to scrape_jobs but the feature’s full documentation is limited. This column allows transforming extracted version strings before comparison but the available transform options need to be documented for end users.

Alert Soft Delete

Migration 016 adds a resolved_at column to alerts for soft delete. Resolved alerts are not physically deleted — they remain in the database with a resolved_at timestamp. There is no automatic cleanup/archival of old resolved alerts. Over time this may require a maintenance strategy.

Multi-Org Membership

Migrations 022-027 add full multi-org support (organization_members, organization_invites, roles). This is a significant feature that enables users to belong to multiple organizations. The org_role enum supports owner, admin, and member roles.

Channel Auth Requirements

Migration 031 adds a require_ack_auth column to notification channels. When enabled, acknowledgment callbacks require the user to be authenticated. This is a security enhancement but requires the webhook consumer to handle authentication redirects.

Alert Ack Metadata

Migration 030 adds metadata support to alert acknowledgments. This allows storing additional context (e.g., who acknowledged, reason) but the UI may not yet expose all metadata fields.

Manual Scrape Job Versions Not Triggering Alerts

If setting a version on a manual scrape job (parse_type: manual) via POST /scrape-jobs/{id}/set-version does not trigger alerts, check:

Alert config exists: An alert_config must link the manual scrape job to a gather job and a monitoring rule. Without this link, rule evaluation has nothing to evaluate.
Event bus is running: Setting a manual version publishes EventJobScrapeCompleted on the event bus. If the event bus subscriber is not running (e.g., server startup issue), rule evaluation will not trigger. Check server logs for event bus initialization.
Version transform is not destructive: The version_transform setting applies to manually entered versions before comparison. Verify the transform is not stripping the version string to empty (e.g., strip_v_lower on a version that does not start with v is safe, but check edge cases).