Internal documentation — not for external distribution.

Troubleshooting Guide

Agent Can’t Connect to Server

  • Check server_url includes scheme (http:// or https://)
  • Verify firewall allows outbound on server port
  • Validate API key: curl -H "X-API-Key: $KEY" $URL/health
  • For remote agents, ensure the Agent API path is reachable through Traefik

Jobs Stuck in “in_progress” or “queued”

  • TaskEngine handles timeouts automatically (30 seconds default)
  • Jobs auto-reset on agent disconnect via orphan detection
  • Jobs stuck in queued indicate no agent is claiming work — check agent heartbeats
  • Jobs stuck in queued for a specific organization: verify the organization has a healthy agent deployed. After the org-scoping fix, agents only claim jobs for their own organization. The Client UI dashboard and job pages show a warning when no active agents are detected.
  • Manual reset:
UPDATE gather_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';
UPDATE scrape_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';

GitHub Rate Limiting

ScenarioRate Limit
Without token60 requests/hour
With token5,000 requests/hour

Set the GITHUB_TOKEN environment variable to increase limits.

Migration Errors

  • Check Postgres is running: docker ps | grep postgres
  • Verify database credentials in .env file
  • Check migration files for syntax errors
  • Review server logs: docker compose logs api

UI Not Updating

  • Run bazel run //toolchains/templ -- generate after .templ file changes
  • Run (handled by Hugo build) after CSS changes
  • Clear browser cache
  • Check for templ compilation errors in the build output

Notification Delivery Failures

  • Check channel configuration (URL, headers, auth)
  • Review delivery logs: GET /notification-deliveries?status=failed
  • Check dead letter queue: GET /notification-deliveries/dead
  • Retry failed deliveries: POST /notification-deliveries/{id}/retry
  • Verify NOTIFICATION_ALLOW_PRIVATE_URLS if targeting internal endpoints

Database Connection Issues

  • Verify PG_* environment variables are set correctly
  • Check PostgreSQL health: pg_isready -U $PG_USER -d $PG_DBNAME
  • Ensure PG_SSLMODE matches your PostgreSQL configuration
  • Check Docker network connectivity between services

Traefik Routing Issues

  • Public routes: Check docker/traefik/dynamic-public.yml
  • Internal routes: Check docker/traefik/dynamic-internal.yml
  • Verify service names match between compose and traefik configs
  • Check Traefik dashboard at localhost:8082 for routing errors

Auth Redirects After Successful Login

Symptom: User logs in successfully (middleware passes) but gets redirected back to /login when accessing protected pages.

This is almost never a cookie/browser/Traefik issue. If auth middleware (RequireAuth, RequireGlobalAdmin) passes, the redirect comes from the handler itself.

Debugging approach — read code, don’t add logging:

  1. Identify which handler serves the page (check internal/handlers/ui.go)
  2. Read the handler — most call getAPIClient(c, cfg) as their first action
  3. Read getAPIClient() in internal/handlers/form_helpers.go — it returns nil, nil when no auth credentials match
  4. If client == nil && err == nil, the handler redirects to login — this is the redirect source

Common causes:

Auth ModeWhy getAPIClient failsFix
Internal UI + SupabaseHas JWT session but orgID = 0 (internal UI skips org). JWT path requires orgID > 0.Use NewAPIClientWithJWTInternal when cfg.IsInternal and session exists
Client UI + API keyNo API key cookie set (user used Supabase login)Ensure JWT path runs before API key fallback
Client UI + Supabase, no orgUser has no org memberships yetRequireOnboarded should redirect to /onboarding before handler runs

Key files to check in order:

FileWhat to verify
internal/handlers/form_helpers.go:getAPIClient()Which auth path runs? All paths covered?
internal/services/api_client.goWhat headers does the API client send?
pkg/api/middleware_auth.goDoes the API server accept the request? Is org required?
internal/config/config.goIs IsInternal set correctly? Are paths correct?

Rule of thumb: If auth middleware logs show success but the page redirects, the problem is in the handler-level API client creation, not in authentication itself. Always read the handler code before adding debug logging.


Maintenance Mode & Outage Page

Enabling Planned Maintenance

Set MAINTENANCE_MODE=true on the clientui service and restart:

# In go/planekeeper/docker/docker-compose.yml or .env
MAINTENANCE_MODE=true

docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui

All protected Client UI routes will show a “Scheduled Maintenance” page (HTTP 503). The Internal UI is unaffected.

Disabling Maintenance Mode

Set MAINTENANCE_MODE=false (or remove it) and restart:

MAINTENANCE_MODE=false
docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui

Advance Warning Banner

Use the Internal UI Maintenance page to set an announcement banner that appears as a modal on all Client UI pages. The banner is cached and appears within 60 seconds. Clear it from the same page after maintenance completes.

If the Internal UI is unavailable, set the banner directly via SQL:

INSERT INTO settings (organization_id, key, value, description)
VALUES (0, 'maintenance_banner', 'warn:Database upgrade in progress — estimated duration: 60 minutes', 'Public maintenance announcement')
ON CONFLICT (organization_id, key) DO UPDATE SET value = EXCLUDED.value;

To clear:

UPDATE settings SET value = '' WHERE organization_id = 0 AND key = 'maintenance_banner';

Automatic Outage Detection

The Client UI background health checker pings the API’s /health endpoint every 10 seconds. After 2 consecutive failures (20 seconds), protected routes automatically show a “Service Temporarily Unavailable” page. Recovery is immediate on the first successful health check.


Permissions Not Working As Expected

The permission system uses Casbin v3 with a SyncedCachedEnforcer that caches enforcement results in memory. Most permission issues trace to either stale cache entries or missing role assignments.

Diagnostic Queries

-- What roles does a user have in an org?
SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';

-- What policy rules exist for a role?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v0 = 'role:<ROLE_KEY>';

-- What deny rules exist?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v4 = 'deny';

-- Is a user a superadmin?
SELECT * FROM casbin_rule WHERE ptype = 'g2' AND v0 = 'user:<USER_ID>';

Deny Rules Not Being Enforced

Symptom: A custom role has Deny statements in its policy document, but the user can still perform the denied actions.

Cause 1 — Role not assigned to user (most common): Creating a role with deny rules does not automatically apply those rules to anyone. The user must be explicitly assigned to that role via the Members page. Casbin evaluates rules per-role — a deny rule on role:deny-custom has no effect on a user who only has role:admin.

The check: Verify the user has a g rule for the deny role:

SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';

If only role:admin appears, the deny role is not assigned. Assign it via the Members page — users can have multiple roles, and the effective permissions are the union of all roles with deny-wins semantics.

Cause 2 — Stale enforcement cache: The SyncedCachedEnforcer caches Enforce() results keyed by request parameters (sub, dom, obj, act). When new policy rules are added via AddPolicies(), Casbin attempts to invalidate related cache entries but uses policy-rule-format keys instead of request-parameter-format keys — so stale “allow” results persist until they expire.

The fix: Cache entries expire after 30 seconds (SetExpireTime in pkg/permissions/enforcer.go). Wait 30 seconds and retry. If running a version without the SetExpireTime fix, restart the API server to clear the in-memory cache.

New User Has No Permissions After Creating an Org

Symptom: A user creates a new organization during onboarding, but all UI elements are grayed out (permission-gated). Logging out and back in does not help.

Root cause: The Casbin SyncedCachedEnforcer has a Go embedding bug — StartAutoLoadPolicy (defined on SyncedEnforcer) dispatches LoadPolicy() to SyncedEnforcer.LoadPolicy() instead of SyncedCachedEnforcer.LoadPolicy(). This means auto-reload fetches new policies from the database but does NOT clear the enforcement result cache. The user’s initial “deny” result (from before the g rule was created during onboarding) was cached and never expired.

The fix: pkg/permissions/enforcer.go sets e.SetExpireTime(autoReloadInterval) so cached results auto-expire every 30 seconds. After the auto-reload picks up the new g rule and the stale cache entry expires, the user gets their permissions.

If the issue persists: Restart the API server to force a full cache clear.

Permission Changes Take Up to 30 Seconds

This is expected behavior. The enforcer caches enforcement results for 30 seconds (autoReloadInterval). After changing role assignments or policy documents:

  1. The database is updated immediately
  2. The auto-reload picks up changes within 30 seconds
  3. Stale cached enforcement results expire within 30 seconds

Worst case: up to 60 seconds for a policy change to take full effect (30s reload + 30s cache expiry).

Understanding the Casbin Model

The permission model uses domain-based RBAC with deny support:

Request:  (subject, domain, object, action)    e.g., ("user:6", "org:2", "gather-jobs", "list")
Policy:   (role, domain, object, action, effect) e.g., ("role:admin", "*", "*", "*", "allow")
Role:     g rules (org-scoped) and g2 rules (global/superadmin)
Effect:   allow if ANY rule allows AND NO rule denies (deny always wins)

How deny-wins works: When evaluating Enforce("user:6", "org:2", "gather-jobs", "list"):

  1. Casbin finds ALL matching policy rules across ALL of the user’s assigned roles
  2. If any matching rule has effect=deny, the result is denied — regardless of how many allow rules match
  3. This means a user with both role:admin (allow all) and role:restricted (deny gather-jobs) will be denied gather-jobs access

Wildcard matching: * in policy rules matches any value:

  • role:admin, *, *, *, allow → allows everything in every org
  • role:custom, org:2, gather-jobs, *, deny → denies all gather-jobs actions in org 2
  • role:custom, org:2, *, get, allow → allows get on all resources in org 2

Key Files

FilePurpose
pkg/permissions/enforcer.goEnforcer creation, cache config, GetEffectiveActions()
pkg/permissions/adapter.goLoads/saves Casbin rules from PostgreSQL
pkg/permissions/middleware.goAPI route → permission check mapping
pkg/permissions/actions.goRoute-to-permission map, all registered actions
pkg/api/roles_handlers.goPolicy document ↔ Casbin rule conversion

Known Issues & Potential Conflicts

This section documents known gaps, partially implemented features, and items that may require attention.

Unimplemented Notification Channel Types

The database enum notification_channel_type defines four values: webhook, pagerduty, telegram, smtp. However, only webhook has an actual sender implementation in pkg/notifications/webhook/. The other three are schema placeholders. The UI restricts channel creation to webhook only, but channels with unsupported types can still be created via the API directly.

EarlyAccess Separate Schema

The earlyaccess binary uses its own PostgreSQL schema (earlyaccess.waitlist_requests) rather than the main application schema. This means:

  • The earlyaccess service manages its own migrations independently
  • The main server migration system does not touch the earlyaccess schema
  • Backup strategies need to account for both schemas

Version Transform Column

Migration 011 adds a version_transform column to scrape_jobs but the feature’s full documentation is limited. This column allows transforming extracted version strings before comparison but the available transform options need to be documented for end users.

Alert Soft Delete

Migration 016 adds a resolved_at column to alerts for soft delete. Resolved alerts are not physically deleted — they remain in the database with a resolved_at timestamp. There is no automatic cleanup/archival of old resolved alerts. Over time this may require a maintenance strategy.

Multi-Org Membership

Migrations 022-027 add full multi-org support (organization_members, organization_invites, roles). This is a significant feature that enables users to belong to multiple organizations. The org_role enum supports owner, admin, and member roles.

Channel Auth Requirements

Migration 031 adds a require_ack_auth column to notification channels. When enabled, acknowledgment callbacks require the user to be authenticated. This is a security enhancement but requires the webhook consumer to handle authentication redirects.

Alert Ack Metadata

Migration 030 adds metadata support to alert acknowledgments. This allows storing additional context (e.g., who acknowledged, reason) but the UI may not yet expose all metadata fields.

Manual Scrape Job Versions Not Triggering Alerts

If setting a version on a manual scrape job (parse_type: manual) via POST /scrape-jobs/{id}/set-version does not trigger alerts, check:

  1. Alert config exists: An alert_config must link the manual scrape job to a gather job and a monitoring rule. Without this link, rule evaluation has nothing to evaluate.
  2. Event bus is running: Setting a manual version publishes EventJobScrapeCompleted on the event bus. If the event bus subscriber is not running (e.g., server startup issue), rule evaluation will not trigger. Check server logs for event bus initialization.
  3. Version transform is not destructive: The version_transform setting applies to manually entered versions before comparison. Verify the transform is not stripping the version string to empty (e.g., strip_v_lower on a version that does not start with v is safe, but check edge cases).