Troubleshooting Guide
Agent Can’t Connect to Server
- Check
server_urlincludes scheme (http://orhttps://) - Verify firewall allows outbound on server port
- Validate API key:
curl -H "X-API-Key: $KEY" $URL/health - For remote agents, ensure the Agent API path is reachable through Traefik
Jobs Stuck in “in_progress” or “queued”
- TaskEngine handles timeouts automatically (30 seconds default)
- Jobs auto-reset on agent disconnect via orphan detection
- Jobs stuck in
queuedindicate no agent is claiming work — check agent heartbeats - Jobs stuck in
queuedfor a specific organization: verify the organization has a healthy agent deployed. After the org-scoping fix, agents only claim jobs for their own organization. The Client UI dashboard and job pages show a warning when no active agents are detected. - Manual reset:
UPDATE gather_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';
UPDATE scrape_jobs SET status = 'queued' WHERE status = 'in_progress' AND updated_at < NOW() - INTERVAL '5 minutes';
GitHub Rate Limiting
| Scenario | Rate Limit |
|---|---|
| Without token | 60 requests/hour |
| With token | 5,000 requests/hour |
Set the GITHUB_TOKEN environment variable to increase limits.
Migration Errors
- Check Postgres is running:
docker ps | grep postgres - Verify database credentials in
.envfile - Check migration files for syntax errors
- Review server logs:
docker compose logs api
UI Not Updating
- Run
bazel run //toolchains/templ -- generateafter.templfile changes - Run
(handled by Hugo build)after CSS changes - Clear browser cache
- Check for templ compilation errors in the build output
Notification Delivery Failures
- Check channel configuration (URL, headers, auth)
- Review delivery logs:
GET /notification-deliveries?status=failed - Check dead letter queue:
GET /notification-deliveries/dead - Retry failed deliveries:
POST /notification-deliveries/{id}/retry - Verify
NOTIFICATION_ALLOW_PRIVATE_URLSif targeting internal endpoints
Database Connection Issues
- Verify
PG_*environment variables are set correctly - Check PostgreSQL health:
pg_isready -U $PG_USER -d $PG_DBNAME - Ensure
PG_SSLMODEmatches your PostgreSQL configuration - Check Docker network connectivity between services
Traefik Routing Issues
- Public routes: Check
docker/traefik/dynamic-public.yml - Internal routes: Check
docker/traefik/dynamic-internal.yml - Verify service names match between compose and traefik configs
- Check Traefik dashboard at
localhost:8082for routing errors
Auth Redirects After Successful Login
Symptom: User logs in successfully (middleware passes) but gets redirected back to /login when accessing protected pages.
This is almost never a cookie/browser/Traefik issue. If auth middleware (RequireAuth, RequireGlobalAdmin) passes, the redirect comes from the handler itself.
Debugging approach — read code, don’t add logging:
- Identify which handler serves the page (check
internal/handlers/ui.go) - Read the handler — most call
getAPIClient(c, cfg)as their first action - Read
getAPIClient()ininternal/handlers/form_helpers.go— it returnsnil, nilwhen no auth credentials match - If
client == nil && err == nil, the handler redirects to login — this is the redirect source
Common causes:
| Auth Mode | Why getAPIClient fails | Fix |
|---|---|---|
| Internal UI + Supabase | Has JWT session but orgID = 0 (internal UI skips org). JWT path requires orgID > 0. | Use NewAPIClientWithJWTInternal when cfg.IsInternal and session exists |
| Client UI + API key | No API key cookie set (user used Supabase login) | Ensure JWT path runs before API key fallback |
| Client UI + Supabase, no org | User has no org memberships yet | RequireOnboarded should redirect to /onboarding before handler runs |
Key files to check in order:
| File | What to verify |
|---|---|
internal/handlers/form_helpers.go:getAPIClient() | Which auth path runs? All paths covered? |
internal/services/api_client.go | What headers does the API client send? |
pkg/api/middleware_auth.go | Does the API server accept the request? Is org required? |
internal/config/config.go | Is IsInternal set correctly? Are paths correct? |
Rule of thumb: If auth middleware logs show success but the page redirects, the problem is in the handler-level API client creation, not in authentication itself. Always read the handler code before adding debug logging.
Maintenance Mode & Outage Page
Enabling Planned Maintenance
Set MAINTENANCE_MODE=true on the clientui service and restart:
# In go/planekeeper/docker/docker-compose.yml or .env
MAINTENANCE_MODE=true
docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui
All protected Client UI routes will show a “Scheduled Maintenance” page (HTTP 503). The Internal UI is unaffected.
Disabling Maintenance Mode
Set MAINTENANCE_MODE=false (or remove it) and restart:
MAINTENANCE_MODE=false
docker compose -f docker/go/planekeeper/docker/docker-compose.yml restart clientui
Advance Warning Banner
Use the Internal UI Maintenance page to set an announcement banner that appears as a modal on all Client UI pages. The banner is cached and appears within 60 seconds. Clear it from the same page after maintenance completes.
If the Internal UI is unavailable, set the banner directly via SQL:
INSERT INTO settings (organization_id, key, value, description)
VALUES (0, 'maintenance_banner', 'warn:Database upgrade in progress — estimated duration: 60 minutes', 'Public maintenance announcement')
ON CONFLICT (organization_id, key) DO UPDATE SET value = EXCLUDED.value;
To clear:
UPDATE settings SET value = '' WHERE organization_id = 0 AND key = 'maintenance_banner';
Automatic Outage Detection
The Client UI background health checker pings the API’s /health endpoint every 10 seconds. After 2 consecutive failures (20 seconds), protected routes automatically show a “Service Temporarily Unavailable” page. Recovery is immediate on the first successful health check.
Permissions Not Working As Expected
The permission system uses Casbin v3 with a SyncedCachedEnforcer that caches enforcement results in memory. Most permission issues trace to either stale cache entries or missing role assignments.
Diagnostic Queries
-- What roles does a user have in an org?
SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';
-- What policy rules exist for a role?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v0 = 'role:<ROLE_KEY>';
-- What deny rules exist?
SELECT * FROM casbin_rule WHERE ptype = 'p' AND v4 = 'deny';
-- Is a user a superadmin?
SELECT * FROM casbin_rule WHERE ptype = 'g2' AND v0 = 'user:<USER_ID>';
Deny Rules Not Being Enforced
Symptom: A custom role has Deny statements in its policy document, but the user can still perform the denied actions.
Cause 1 — Role not assigned to user (most common): Creating a role with deny rules does not automatically apply those rules to anyone. The user must be explicitly assigned to that role via the Members page. Casbin evaluates rules per-role — a deny rule on role:deny-custom has no effect on a user who only has role:admin.
The check: Verify the user has a g rule for the deny role:
SELECT * FROM casbin_rule WHERE ptype = 'g' AND v0 = 'user:<USER_ID>' AND v2 = 'org:<ORG_ID>';
If only role:admin appears, the deny role is not assigned. Assign it via the Members page — users can have multiple roles, and the effective permissions are the union of all roles with deny-wins semantics.
Cause 2 — Stale enforcement cache: The SyncedCachedEnforcer caches Enforce() results keyed by request parameters (sub, dom, obj, act). When new policy rules are added via AddPolicies(), Casbin attempts to invalidate related cache entries but uses policy-rule-format keys instead of request-parameter-format keys — so stale “allow” results persist until they expire.
The fix: Cache entries expire after 30 seconds (SetExpireTime in pkg/permissions/enforcer.go). Wait 30 seconds and retry. If running a version without the SetExpireTime fix, restart the API server to clear the in-memory cache.
New User Has No Permissions After Creating an Org
Symptom: A user creates a new organization during onboarding, but all UI elements are grayed out (permission-gated). Logging out and back in does not help.
Root cause: The Casbin SyncedCachedEnforcer has a Go embedding bug — StartAutoLoadPolicy (defined on SyncedEnforcer) dispatches LoadPolicy() to SyncedEnforcer.LoadPolicy() instead of SyncedCachedEnforcer.LoadPolicy(). This means auto-reload fetches new policies from the database but does NOT clear the enforcement result cache. The user’s initial “deny” result (from before the g rule was created during onboarding) was cached and never expired.
The fix: pkg/permissions/enforcer.go sets e.SetExpireTime(autoReloadInterval) so cached results auto-expire every 30 seconds. After the auto-reload picks up the new g rule and the stale cache entry expires, the user gets their permissions.
If the issue persists: Restart the API server to force a full cache clear.
Permission Changes Take Up to 30 Seconds
This is expected behavior. The enforcer caches enforcement results for 30 seconds (autoReloadInterval). After changing role assignments or policy documents:
- The database is updated immediately
- The auto-reload picks up changes within 30 seconds
- Stale cached enforcement results expire within 30 seconds
Worst case: up to 60 seconds for a policy change to take full effect (30s reload + 30s cache expiry).
Understanding the Casbin Model
The permission model uses domain-based RBAC with deny support:
Request: (subject, domain, object, action) e.g., ("user:6", "org:2", "gather-jobs", "list")
Policy: (role, domain, object, action, effect) e.g., ("role:admin", "*", "*", "*", "allow")
Role: g rules (org-scoped) and g2 rules (global/superadmin)
Effect: allow if ANY rule allows AND NO rule denies (deny always wins)
How deny-wins works: When evaluating Enforce("user:6", "org:2", "gather-jobs", "list"):
- Casbin finds ALL matching policy rules across ALL of the user’s assigned roles
- If any matching rule has
effect=deny, the result is denied — regardless of how many allow rules match - This means a user with both
role:admin(allow all) androle:restricted(deny gather-jobs) will be denied gather-jobs access
Wildcard matching: * in policy rules matches any value:
role:admin, *, *, *, allow→ allows everything in every orgrole:custom, org:2, gather-jobs, *, deny→ denies all gather-jobs actions in org 2role:custom, org:2, *, get, allow→ allowsgeton all resources in org 2
Key Files
| File | Purpose |
|---|---|
pkg/permissions/enforcer.go | Enforcer creation, cache config, GetEffectiveActions() |
pkg/permissions/adapter.go | Loads/saves Casbin rules from PostgreSQL |
pkg/permissions/middleware.go | API route → permission check mapping |
pkg/permissions/actions.go | Route-to-permission map, all registered actions |
pkg/api/roles_handlers.go | Policy document ↔ Casbin rule conversion |
Known Issues & Potential Conflicts
This section documents known gaps, partially implemented features, and items that may require attention.
Unimplemented Notification Channel Types
The database enum notification_channel_type defines four values: webhook, pagerduty, telegram, smtp. However, only webhook has an actual sender implementation in pkg/notifications/webhook/. The other three are schema placeholders. The UI restricts channel creation to webhook only, but channels with unsupported types can still be created via the API directly.
EarlyAccess Separate Schema
The earlyaccess binary uses its own PostgreSQL schema (earlyaccess.waitlist_requests) rather than the main application schema. This means:
- The earlyaccess service manages its own migrations independently
- The main server migration system does not touch the
earlyaccessschema - Backup strategies need to account for both schemas
Version Transform Column
Migration 011 adds a version_transform column to scrape_jobs but the feature’s full documentation is limited. This column allows transforming extracted version strings before comparison but the available transform options need to be documented for end users.
Alert Soft Delete
Migration 016 adds a resolved_at column to alerts for soft delete. Resolved alerts are not physically deleted — they remain in the database with a resolved_at timestamp. There is no automatic cleanup/archival of old resolved alerts. Over time this may require a maintenance strategy.
Multi-Org Membership
Migrations 022-027 add full multi-org support (organization_members, organization_invites, roles). This is a significant feature that enables users to belong to multiple organizations. The org_role enum supports owner, admin, and member roles.
Channel Auth Requirements
Migration 031 adds a require_ack_auth column to notification channels. When enabled, acknowledgment callbacks require the user to be authenticated. This is a security enhancement but requires the webhook consumer to handle authentication redirects.
Alert Ack Metadata
Migration 030 adds metadata support to alert acknowledgments. This allows storing additional context (e.g., who acknowledged, reason) but the UI may not yet expose all metadata fields.
Manual Scrape Job Versions Not Triggering Alerts
If setting a version on a manual scrape job (parse_type: manual) via POST /scrape-jobs/{id}/set-version does not trigger alerts, check:
- Alert config exists: An
alert_configmust link the manual scrape job to a gather job and a monitoring rule. Without this link, rule evaluation has nothing to evaluate. - Event bus is running: Setting a manual version publishes
EventJobScrapeCompletedon the event bus. If the event bus subscriber is not running (e.g., server startup issue), rule evaluation will not trigger. Check server logs for event bus initialization. - Version transform is not destructive: The
version_transformsetting applies to manually entered versions before comparison. Verify the transform is not stripping the version string to empty (e.g.,strip_v_loweron a version that does not start withvis safe, but check edge cases).