Internal documentation — not for external distribution.

Planekeeper Business Logic Documentation

A comprehensive guide to the business logic, workflows, and domain rules that power Planekeeper’s automated software version staleness detection system.


Table of Contents

  1. Executive Summary
  2. Gather Jobs Domain
  3. Scrape Jobs Domain
  4. Helm Sync Jobs Domain
  5. Task Execution System
  6. Background Scheduler
  7. Rules Engine
  8. Alert System
  9. Notification System
  10. Multi-Tenancy Model
  11. Agent Communication
  12. Metrics API
  13. Developer Tools
  14. Admin UI
  15. Open Questions & Ambiguities
  16. Regression Test Recommendations

1. Executive Summary

System Purpose

Planekeeper is an automated software version staleness detection system. It monitors deployed software versions against upstream releases to identify when software falls behind, applying configurable rules to generate severity-graded alerts.

Architecture Triad

┌─────────────────────────────────────────────────────────────────┐
│                           SERVER                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌────────────────┐  │
│  │ REST API │  │ Admin UI │  │ Heartbeat │  │ Orphan Cleanup │  │
│  │ (Fiber)  │  │ (templ)  │  │  Service  │  │    Service     │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └───────┬────────┘  │
│       └──────────────┴─────────────┴────────────────┘            │
│                            │                                     │
│                   ┌────────┴────────┐                            │
│                   │   PostgreSQL    │                            │
│                   │   (sqlc/goose)  │                            │
│                   └────────┬────────┘                            │
└────────────────────────────┼────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│    AGENT 1    │    │  TASKENGINE   │    │    AGENT N    │
│ ┌───────────┐ │    │ ┌───────────┐ │    │ ┌───────────┐ │
│ │ Gatherer  │ │    │ │ Scheduler │ │    │ │  Scraper  │ │
│ │ (GitHub)  │ │    │ │  (cron)   │ │    │ │ (git+YQ)  │ │
│ └───────────┘ │    │ ├───────────┤ │    │ └───────────┘ │
│ ┌───────────┐ │    │ │ Processor │ │    └───────────────┘
│ │  Scraper  │ │    │ │ (results) │ │
│ └───────────┘ │    │ └───────────┘ │
└───────────────┘    └───────────────┘

Server: Hosts REST API, Admin UI, runs database migrations, manages heartbeat detection and orphan cleanup.

Agent: Polls for tasks, executes gather (fetch upstream releases) and scrape (extract deployed versions) jobs.

TaskEngine: Handles job scheduling, timeout detection, cron-based rescheduling, and result processing.

Core Workflow

┌──────────────┐     ┌───────────────┐     ┌──────────────┐     ┌─────────────┐
│ GATHER JOBS  │────▶│  SCRAPE JOBS  │────▶│ RULES ENGINE │────▶│   ALERTS    │
│              │     │               │     │              │     │             │
│ Fetch latest │     │ Extract       │     │ Calculate    │     │ Create/     │
│ releases     │     │ deployed      │     │ behind-by    │     │ update      │
│ from GitHub/ │     │ versions      │     │ values       │     │ alerts      │
│ Helm repos   │     │ from repos    │     │              │     │             │
└──────────────┘     └───────────────┘     └──────────────┘     └─────────────┘
       │                    │                    │                    │
       ▼                    ▼                    ▼                    ▼
 upstream_releases   version_snapshots     alert_configs          alerts
    (table)              (table)             (table)             (table)

2. Gather Jobs Domain

Files: pkg/gatherer/github.go, pkg/gatherer/helm.go, pkg/gatherer/oci.go, pkg/gatherer/endoflife.go, pkg/api/gather_jobs_handler.go, pkg/storage/queries/gather_jobs.sql

Purpose

Fetch upstream releases from external sources to establish the “latest available version” baseline for staleness detection. ~173 global gather jobs are pre-seeded (migration 034) to provide immediate coverage of common infrastructure software.

Source Types

Source TypeExample Artifact NameDescription
github_releasesgithub.com/kubernetes/kubernetesGitHub releases via REST API
helm_repositoryargoproj.github.io/argo-helm/argo-cdHelm chart versions via index.yaml
oci_registrydocker.io/library/nginxOCI container image tags via registry API
endoflife_dateendoflife.date/pythonProduct lifecycle data from endoflife.date

Input Parameters

FieldRequiredDescription
artifact_nameYesSource identifier (owner/repo for GitHub, repo-url/chart-name for Helm)
source_typeYesgithub_releases or helm_repository
nameNoHuman-readable job name
scheduleNoCron expression for recurring execution
tag_filterNoRegex to filter which tags to include
version_regexNoRegex with capture group to extract version from tag
source_configNoJSONB for source-specific configuration
labelsNoJSONB key-value pairs for categorization (e.g., {"category": "kubernetes"})
is_globalNoCreate as global resource (system keys only)

Business Rules

Tag Filtering:

  • Optional regex pattern applied to release tags
  • Non-matching tags are excluded from results
  • Invalid regex patterns logged as warnings, fall back to no filtering

Version Extraction:

  1. If version_regex has capture group → use first captured group
  2. If version_regex matches but no capture group → use full match
  3. Fallback: Strip leading v or V prefix (GitHub) or return as-is (Helm)

Pagination Limits:

  • GitHub: 100 releases per page, max 10 pages (1,000 releases total)
  • Helm: Max 1,000 versions per chart, 50 MB index file limit

Prerelease Detection (Helm): Versions containing these patterns (case-insensitive) are marked as prereleases:

  • -alpha, -beta, -rc, -dev, -preview, -snapshot
  • .alpha, .beta, .rc, .dev, .preview, .snapshot

State Machine

                    ┌──────────────────────────────────────────────────────────┐
                    │                                                          │
                    ▼                                                          │
┌─────────┐    ┌─────────┐    ┌────────┐    ┌───────────┐    ┌────────┐       │
│ CREATED │───▶│ PENDING │───▶│ QUEUED │───▶│IN_PROGRESS│───▶│COMPLETED│──────┘
└─────────┘    └─────────┘    └────────┘    └───────────┘    └────────┘  (reschedule
                    ▲              ▲              │                       if cron)
                    │              │              │
                    │              │   ┌──────────┴──────────────┐
                    │              │   │                         │
                    │              │   ▼                         ▼
                    │              │ attempts < max         attempts >= max
                    │              │   │                         │
                    │              │   │                         ▼
                    │              └───┘                    ┌────────┐
                    │        (retry with backoff)           │ FAILED │
                    │                                       └────────┘
                    │
                    └── (reschedule / stale reset / orphan reset)
  • pending: Job is scheduled for the future (next_run_at > now) or waiting for retry_after
  • queued: Job is eligible for agent pickup (next_run_at <= now, retry_after passed)
  • in_progress: Agent has claimed and is executing the job

Key Transitions:

TransitionTriggerSQL Function
→ pendingJob created (future schedule)CreateGatherJob
→ queuedJob created (immediate run)TriggerGatherJobNow
pending → queuedSchedule time reachedTransitionPendingToQueuedGatherJobs
queued → in_progressAgent claims jobClaimNextQueuedGatherJob (SKIP LOCKED)
in_progress → completedAgent reports successCompleteGatherJob
in_progress → queuedAgent reports failure (retries remain)FailGatherJob
in_progress → failedAgent reports failure (max attempts)FailGatherJob
completed → pendingCron schedule triggersRescheduleGatherJob
in_progress → queuedJob stale >1 hourResetStaleGatherJobs
* → queuedClaimed agent disconnectedResetOrphanedGatherJobs

Error Handling

Exponential Backoff:

retry_delay = 2^(attempts+1) seconds

Attempt 1: 4 seconds
Attempt 2: 8 seconds
Attempt 3: 16 seconds
...

GitHub-Specific Errors:

HTTP StatusError Message
401GitHub authentication failed: invalid or missing token
403 (rate limit)GitHub rate limit exceeded: resets at <timestamp>
403 (other)GitHub access forbidden: repository may be private
404GitHub repository not found: <owner/repo>
429GitHub secondary rate limit hit: retry after N seconds

Side Effects

  • On Success: Upsert releases into upstream_releases table (conflict on artifact_name + version)
  • On Reschedule: Calculate next run time using cron expression
  • On Config Update: Delete all existing releases (orphaned by config change)

Incremental Sync (GitHub)

GitHub gather jobs support incremental sync to reduce API calls and improve performance for repositories with many releases.

How it works:

  1. First run (full sync): The gatherer fetches all releases by paginating through the GitHub Releases API (up to 100 pages × 100 per page = 10,000 releases max). After completion, the processor writes sync state to gather_jobs.sync_state:

    • full_sync_complete: true if all releases were fetched without hitting the page limit
    • releases_fetched: total count from this run
    • last_full_sync_at: current timestamp (only set for full syncs)
  2. Subsequent runs (incremental): If full_sync_complete is true and last_full_sync_at is within the full sync interval (default: 2 weeks), the dispatcher injects an _incremental_since hint into the agent’s source config. The GitHub gatherer uses this date to stop pagination early — once it encounters a release older than or equal to the hint date, it stops fetching additional pages.

  3. Periodic full sync: When last_full_sync_at exceeds the configured GATHER_FULL_SYNC_INTERVAL (default: 336h / 2 weeks), the dispatcher omits the incremental hint, forcing a full re-fetch. This catches edited releases, backdated tags, or metadata changes that incremental mode would miss.

  4. Sync state reset: Sync state is reset to {} when:

    • A gather job’s configuration is updated (via PUT /gather-jobs/{id})
    • A gather job’s releases are cleared (via POST /gather-jobs/{id}/clear-releases)

Scope: Only github_releases source type supports incremental sync. Helm, OCI, and endoflife.date gatherers always do full fetches (they don’t paginate the same way).


3. Scrape Jobs Domain

Files: pkg/scraper/scraper.go, pkg/parser/*.go, pkg/api/scrape_jobs_handler.go, pkg/storage/queries/scrape_jobs.sql

Purpose

Extract deployed software versions from Git repositories by parsing configuration files (YAML, JSON, or text) to establish the “currently deployed version” for staleness comparison.

Input Parameters

FieldRequiredDescription
repository_urlConditionalGit repository URL (HTTPS or SSH; not required for manual)
target_fileConditionalPath to file containing version (e.g., Chart.yaml; not required for manual)
parse_typeYesParser type: yq, jq, regex, or manual
parse_expressionConditionalParser-specific expression (not required for manual)
refConditionalGit ref to checkout (default: main; not required for manual)
credential_nameNoNamed credential for authentication
scheduleNoCron expression for recurring execution
version_transformNoPost-parse transformation
history_limitNoMax version snapshots to retain (1-20)

Parser Implementations

YQ Parser (YAML)

Expression Format: Dot-notation with array indexing

.version                    → Simple field
.metadata.version           → Nested path
.dependencies[0].version    → Array access
.items[2].name              → Nested array access

Features:

  • Supports both map[string]any and map[any]any YAML structures
  • Uses gopkg.in/yaml.v3 for parsing

JQ Parser (JSON)

Expression Format: Dot-notation (no array indexing)

.version                    → Simple field
.dependencies.react         → Nested path

Limitations: Does not support array indexing (use YQ for JSON with arrays)

Regex Parser (Text)

Expression Format: Go regex pattern

version:\s*([\d.]+)         → Captures version after "version:"
^v(\d+\.\d+\.\d+)$          → Captures semver from full line

Behavior: Returns first capture group if present, otherwise full match

Manual Entry (No Agent)

Manual parse type allows users to enter version strings directly via the API or UI, bypassing the agent-based scraping pipeline entirely.

Behavior:

  • No repository is cloned; no file is parsed
  • Placeholder values are stored: repository_url="manual://", target_file="-", parse_expression="-"
  • Jobs are created with status='completed' (never pending — agents never pick them up)
  • Version is set via POST /scrape-jobs/{id}/set-version with a version string in the request body
  • Version transforms are applied to the manually entered version before storage
  • A version snapshot is created and rule evaluation is triggered, identical to agent-completed jobs

Use Cases:

  • Demo and testing environments (no agent infrastructure needed)
  • Manual version tracking for systems that can’t be scraped
  • Quick setup to exercise the full alert pipeline (rules, alerts, notifications)

Key Differences from Agent-Based Jobs:

AspectAgent-BasedManual
Task claimingPolled by agents via SKIP LOCKEDNever enters task queue
Initial statuspendingcompleted
“Run Now” buttonTriggers agent executionNot available
Version entryAutomatic (parser output)Via set-version API/UI
Required fieldsrepo URL, ref, target file, expressionParse type only

Version Transforms

TransformExample InputOutput
none1.2.31.2.3
add_v_lower1.2.3v1.2.3
add_v_upper1.2.3V1.2.3
strip_v_lowerv1.2.31.2.3
strip_v_upperV1.2.31.2.3

Credential-Aware Assignment

Jobs requiring credentials are only assigned to agents that have those credentials:

-- ClaimNextPendingScrapeJobWithCredentials
WHERE (
    credential_name IS NULL
    OR credential_name = ANY(available_credentials::VARCHAR[])
)

Flow:

  1. Agent declares available_credentials during heartbeat
  2. Job poll includes credential list
  3. Server only assigns jobs where agent has required credential or job needs no credential

Git Clone Optimization

All git clone operations use shallow clones for performance and reduced disk usage:

SettingValuePurpose
Depth1Only fetch the latest commit (no history)
SingleBranchtrueOnly fetch the requested branch/tag

Benefits:

  • Significantly faster clone times, especially for large repositories
  • Reduced disk usage on agents (no commit history stored)
  • Lower bandwidth consumption from git servers
  • Minimal data transfer for version extraction (only need file content, not history)

Implementation (pkg/git/cloner.go):

cloneOpts := &git.CloneOptions{
    URL:           url,
    Depth:         1,           // Shallow clone
    SingleBranch:  true,        // Only requested ref
    ReferenceName: ref,
}

Temporary Directory Lifecycle:

  1. Create temp directory: planekeeper-clone-*
  2. Shallow clone repository
  3. Read target file for version extraction
  4. Delete temp directory (cleanup)

State Machine

Identical to Gather Jobs (see Section 2), with these additional behaviors:

  • On Success: Create version_snapshot record, trigger async rule evaluation
  • History Limit: Older snapshots beyond limit are automatically deleted
  • Manual Jobs: Start and remain in completed status. Version updates via set-version create snapshots and trigger rule evaluation without changing job status

Side Effects

  • On Success:
    • Insert version_snapshot with version, raw_content, commit_sha, metadata
    • Trigger ruleEvaluator.EvaluateForOrg() asynchronously
  • On History Limit Exceeded: Delete oldest snapshots via orphan cleanup service

Version Snapshot Storage

Files: pkg/storage/queries/version_snapshots.sql, pkg/api/tasks_handler.go:661-706

Full History Tracking

Every scrape creates a new version snapshot record, regardless of whether the version has been seen before. This enables proper tracking of version changes including rollbacks.

Example Scenario:

Time T1: Scrape finds version 6.11.1  → Snapshot #1 created
Time T2: Scrape finds version 9.3.7  → Snapshot #2 created
Time T3: Rollback to 6.11.1          → Snapshot #3 created (NOT a duplicate)

Key Design Decisions:

  1. No Unique Constraint on Version: The table allows multiple snapshots with the same version for the same job. This is intentional—each scrape represents a point-in-time observation.

  2. ID-Based Ordering: The “latest” version snapshot is determined by ORDER BY id DESC, not by timestamp. Since IDs are auto-incrementing, this guarantees the most recently inserted row is always returned, regardless of timestamp precision issues.

  3. History Retention: The history_limit setting on scrape jobs controls how many snapshots to retain. Older snapshots beyond this limit are automatically purged.

Database Schema

-- No unique constraint on version - allows duplicate versions for rollback tracking
CREATE TABLE version_snapshots (
    id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    organization_id BIGINT NOT NULL,
    scrape_job_id BIGINT,
    repository_url VARCHAR(2048) NOT NULL,
    ref VARCHAR(256),
    target_file VARCHAR(1024) NOT NULL,
    version VARCHAR(256) NOT NULL,
    raw_content TEXT,
    metadata JSONB DEFAULT '{}',
    discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index for efficient latest-version lookups
CREATE INDEX idx_version_snapshots_latest
ON version_snapshots(scrape_job_id, id DESC);

Query Pattern

-- GetLatestVersionSnapshot: Returns most recently inserted snapshot
SELECT version, discovered_at
FROM version_snapshots
WHERE scrape_job_id = $1
ORDER BY id DESC  -- NOT discovered_at - ID is more reliable
LIMIT 1;

Why ID instead of timestamp?

  • Timestamps can have precision issues (same second, microsecond truncation)
  • Multiple inserts in quick succession may get identical timestamps
  • Auto-incrementing IDs guarantee insertion order
  • Avoids race conditions during concurrent operations

Rule Evaluation Timing

When a scrape job completes (agent-based):

1. Agent sends task completion to API
2. API creates version snapshot (INSERT)
3. API marks job as completed
4. API triggers rule evaluation asynchronously
   └── EvaluateForOrg() spawns goroutine internally
5. Rule engine queries for latest version snapshot
   └── Uses id DESC to get most recently inserted row

When a manual version is set:

1. User calls POST /scrape-jobs/{id}/set-version with version string
2. API applies version transform (if configured)
3. API creates version snapshot using shared recordVersionSnapshot helper
4. API triggers rule evaluation asynchronously (same as agent flow)

The rule evaluation runs asynchronously (via goroutine) but uses a fresh database connection that sees the committed snapshot data.


4. Helm Sync Jobs Domain

Files: pkg/api/helm_sync_handlers.go, pkg/taskengine/processor.go:293-427

Purpose

Automatically discover Helm charts from a repository and create/manage child gather jobs for each chart, enabling bulk monitoring of Helm-based deployments.

Input Parameters

FieldRequiredDescription
repository_urlYesHelm repository URL
chart_filterNoRegex to filter charts by name
default_scheduleNoCron schedule inherited by child jobs
default_tag_filterNoTag filter inherited by child jobs
default_version_regexNoVersion regex inherited by child jobs
auto_delete_orphansNoDelete child jobs for removed charts

Business Rules

Chart Discovery:

  1. Fetch index.yaml from repository
  2. Apply optional chart_filter regex
  3. Extract chart name, description, latest version

Child Job Management:

For each discovered chart:
├── Build artifact_name: "repo-url/chart-name"
├── Check if gather job exists for this artifact
│   ├── YES: Skip (job already exists)
│   └── NO: Create gather job with:
│       ├── source_type = helm_repository
│       ├── schedule = default_schedule
│       ├── tag_filter = default_tag_filter
│       ├── version_regex = default_version_regex
│       └── parent_sync_job_id = this job's ID
└── Track artifact_name in discovered list

Orphan Deletion (when auto_delete_orphans = true):

DELETE FROM gather_jobs
WHERE parent_sync_job_id = @sync_job_id
  AND artifact_name NOT IN (@discovered_artifacts)

State Machine

Same as Gather Jobs, with additional post-completion processing to create/delete child jobs.

Side Effects

  • On Success:
    • Create gather jobs for newly discovered charts
    • Delete gather jobs for removed charts (if auto_delete_orphans)
    • Reschedule if cron schedule exists

5. Task Execution System

Files: pkg/taskengine/dispatcher.go, pkg/taskengine/types.go, pkg/shared/types.go, pkg/taskengine/errors.go, pkg/api/tasks_handler.go

Purpose

Coordinate distributed job execution between agents and server through a token-based system with idempotency guarantees.

Token Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│                      TOKEN LIFECYCLE                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────┐    ┌────────────┐    ┌────────────┐    ┌────────┐  │
│  │GENERATION│───▶│ VALIDATION │───▶│ COMPLETION │───▶│CONSUMED│  │
│  └──────────┘    └────────────┘    └────────────┘    └────────┘  │
│       │                │                 │               │        │
│       ▼                ▼                 ▼               ▼        │
│  UUID created    Check expiry     Mark complete    Token          │
│  Expiry set      Check status     Process result   invalidated    │
│  Stored in DB    Return task      Update job                      │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Generation:

  • New UUID via uuid.New()
  • Expiry: time.Now().UTC().Add(timeout) (default 5 minutes, configurable via task_execution_timeout_seconds)
  • Stored in task_executions table

Validation:

  • Lookup by token
  • Check status is in_progress
  • Check not expired

Completion:

  • Update status to completed or failed
  • Process job-specific results
  • Clear execution link from job

Assignment Flow

Agent polls: POST /tasks/{AgentUUID}/poll
    │
    ├── Capabilities: ["gather", "scrape", "helm_sync"]
    ├── Available credentials: ["github_token", "ssh_key"]
    ├── Organization ID: from agent's API key
    │
    ▼
Dispatcher.AssignTaskWithCredentials()
    │
    ├── Try ClaimNextPendingGatherJob (SKIP LOCKED, org-scoped)
    ├── Try ClaimNextPendingScrapeJob (SKIP LOCKED, org-scoped)
    └── Try ClaimNextPendingHelmSyncJob (SKIP LOCKED, org-scoped)
    │
    ▼
CreateTaskExecution() + Return TaskAssignment
    │
    ├── execution_token: UUID
    ├── expires_at: timestamp
    ├── job_type: "gather"|"scrape"|"helm_sync"
    ├── job_id: int64
    └── task_data: job-specific details

Completion Flow

Agent completes: POST /tasks/{AgentUUID}/complete
    │
    ├── execution_token: UUID
    ├── success: bool
    ├── error: optional string
    └── result_data: JSON
    │
    ▼
CompleteExecution()
    │
    ├── Lookup execution by token
    │   └── Not found → ErrTokenInvalid (409)
    │
    ├── Check status == 'in_progress'
    │   └── Already completed → ErrTokenAlreadyCompleted (202)
    │
    ├── Check not expired
    │   └── Expired → ErrTokenExpired (409)
    │
    └── Update execution status
    │
    ▼
ProcessTaskResult() (job-type specific)
    │
    └── Return 202 Accepted

Error States

ErrorHTTP StatusScenario
ErrNoTasksAvailable204 No ContentNo pending tasks match agent capabilities
ErrTokenInvalid409 ConflictToken not found in database
ErrTokenExpired409 ConflictToken past expiry timestamp
ErrTokenAlreadyCompleted202 AcceptedIdempotent retry (result already recorded)
ErrJobNotFound500Job missing during result processing
ErrUnsupportedJobType500Unknown job type in execution

Idempotency Guarantees

Database-Level (SKIP LOCKED):

SELECT id FROM jobs
WHERE status = 'pending'
  AND next_run_at <= CURRENT_TIMESTAMP
ORDER BY next_run_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED

Prevents multiple agents from claiming the same job.

Organization-Level (Org-Scoped Claims):

All ClaimNextQueued* queries filter by organization_id, ensuring agents only claim jobs belonging to their own organization. The serveragent (org_id=0) only claims global jobs; clientagents only claim their respective organization’s jobs. This enforces multi-tenant isolation at the task claim layer.

Token-Level (State Check):

if execution.Status != repository.JobStatusInProgress {
    return nil, ErrTokenAlreadyCompleted
}

Allows safe agent retries without duplicate processing.


6. Background Scheduler

Files: pkg/taskengine/engine.go, pkg/taskengine/scheduler.go, pkg/taskengine/processor.go

Processing Loops

The TaskEngine runs two concurrent background loops:

LoopIntervalPurposeBatch Size
Processor5 secondsProcess pending results100
Scheduler30 secondsTimeout detection + scheduled activation100

Timeout Handling

Stale Job Detection: Jobs in in_progress status for more than 1 hour are reset to pending:

-- ResetStaleGatherJobs / ResetStaleScrapeJobs
UPDATE jobs
SET status = 'pending',
    claimed_by = NULL,
    claimed_at = NULL
WHERE status = 'in_progress'
  AND claimed_at < NOW() - INTERVAL '1 hour'

Purpose: Recover from agent crashes where no completion/failure was reported.

Scheduled Activation

Jobs with cron schedules follow this lifecycle:

┌──────────┐   ┌─────────────┐   ┌───────────┐   ┌───────────┐
│ COMPLETED│──▶│  RESCHEDULE │──▶│  WAITING  │──▶│ ACTIVATED │
│          │   │             │   │           │   │           │
│ Job done │   │ Calculate   │   │next_run_at│   │ Set status│
│          │   │ next_run_at │   │ in future │   │ = pending │
└──────────┘   └─────────────┘   └───────────┘   └───────────┘
                                       │               │
                                       │  time passes  │
                                       └───────────────┘

Rescheduling (Processor, on completion):

nextRun := cron.NextRun(job.Schedule, time.Now().UTC())
repo.RescheduleJob(jobID, nextRun)

Activation (Scheduler, every 30 seconds):

-- GetScheduledJobsReadyToRun
SELECT id FROM jobs
WHERE status = 'completed'
  AND schedule IS NOT NULL
  AND next_run_at <= CURRENT_TIMESTAMP

-- ActivateScheduledJob
UPDATE jobs
SET status = 'pending',
    next_run_at = NULL
WHERE id = @id

Orphan Detection

Service-Based Orphan Recovery: Jobs claimed by agents no longer in service_instances table are reset:

-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending',
    claimed_by = NULL
WHERE status IN ('pending', 'in_progress')
  AND claimed_by IS NOT NULL
  AND claimed_by NOT IN (
    SELECT instance_uuid FROM service_instances WHERE service_id = 2
  )

Orphan Cleanup Service (Server):

  • Runs every 2 minutes
  • 30-second startup delay (allows agents to register)
  • Also enforces version snapshot history limits

7. Rules Engine

Files: pkg/rules/evaluator.go, pkg/rules/engine.go, pkg/rules/types.go

Rule Types

Rule TypeMeasuresCalculation
days_behindAge of discovered versiontime.Since(releaseDate).Hours() / 24
majors_behindMajor version differencelatestMajor - discoveredMajor
minors_behindMinor version differenceRelease-list counting or formula-based

Threshold Tiers

Each rule defines three thresholds (must be ordered: moderate ≤ high ≤ critical):

type Rule struct {
    ModerateThreshold  int    // First tier
    HighThreshold      int    // Second tier
    CriticalThreshold  int    // Third tier
}

Severity Determination (highest wins):

if behindBy >= CriticalThreshold → CRITICAL
else if behindBy >= HighThreshold → HIGH
else if behindBy >= ModerateThreshold → MODERATE
else → No violation

Evaluation Algorithm

1. Validate inputs (config, rule, discovered version required)
       │
       ▼
2. Get latest release from gather job
   └── If none available → No violation (can't compare)
       │
       ▼
3. Apply stable_only filter (if enabled)
   └── If latest is prerelease → Find next stable release
       │
       ▼
4. Calculate behindBy based on rule type:
   │
   ├── days_behind:
   │   └── Get discovered version's release date
   │       └── If not found → ErrVersionNotFound → CRITICAL
   │   └── behindBy = days since release
   │
   ├── majors_behind:
   │   └── Parse both versions as semver
   │       └── If parse fails → ErrVersionParseFailed → CRITICAL
   │   └── behindBy = latestMajor - discoveredMajor
   │
   └── minors_behind:
       ├── Preferred: Count unique major.minor from releases list
       └── Fallback: Formula calculation
       │
       ▼
5. Determine severity from thresholds
       │
       ▼
6. Return EvaluationResult with severity, behindBy, message

Minors Behind Calculation

Release-List Mode (preferred): Counts unique major.minor combinations between discovered and latest versions.

Formula Mode (fallback when releases unavailable):

Same major:      latestMinor - discoveredMinor
Different major: (latestMajor - discoveredMajor) + latestMinor

Example: 7.9 → 8.1 = (8-7) + 1 = 2 minors behind
Example: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind

Stable-Only Filtering

When stable_only = true, prereleases are excluded from latest version lookup.

Prerelease Detection (IsStableVersion()): Returns false if version contains (case-insensitive):

  • alpha, beta, rc, dev, snapshot, canary, nightly, pre

Special Cases

Version Not Found:

  • Trigger: Discovered version has no release record (for days_behind)
  • Result: Marked as CRITICAL with behindBy = -1
  • Message: "Version X not found in upstream releases - cannot determine age (marked as critical)"

Version Parse Failure:

  • Trigger: Semver parsing fails for either version
  • Result: Marked as CRITICAL with behindBy = -1
  • Message: "Cannot parse version for comparison: <error> (marked as critical)"

8. Alert System

Files: pkg/alerts/service.go, pkg/api/alerts_handlers.go, pkg/api/alert_configs_handler.go, pkg/rules/engine.go, pkg/storage/queries/alerts.sql

Core Concepts

One Alert Per Config: Each alert configuration has at most ONE active alert at any time. When a scrape job discovers a new version, the existing alert is updated in place rather than creating a new alert. This ensures alerts always reflect the current state of the monitored system.

Soft Delete: Resolved alerts are soft-deleted (marked with resolved_at timestamp) rather than permanently deleted. This preserves alert history for auditing and analysis.

Alert Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│                     ALERT LIFECYCLE                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌────────────────┐    ┌─────────────────────┐  │
│  │ CREATED  │───▶│ VERSION UPDATE │───▶│    ACKNOWLEDGED     │  │
│  │          │    │                │    │                     │  │
│  │ Violation│    │ Alert updated  │    │ User marks as       │  │
│  │ detected │    │ with new       │    │ reviewed            │  │
│  └──────────┘    │ version data   │    └─────────────────────┘  │
│       │          └────────────────┘             │                │
│       │                 │                       │                │
│       │                 │     ┌─────────────────┘                │
│       │                 │     │                                  │
│       │                 ▼     ▼                                  │
│       │          ┌──────────────────┐                            │
│       │          │  UNACKNOWLEDGED  │                            │
│       │          │                  │                            │
│       │          │  Ack reset when  │                            │
│       │          │  version changes │                            │
│       │          └──────────────────┘                            │
│       │                 │                                        │
│       │                 │                                        │
│       ▼                 ▼                                        │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                       RESOLVED                              │  │
│  │  Version updated, no longer violates                        │  │
│  │  → Alert soft-deleted (resolved_at = now)                   │  │
│  │  → Preserved in history via /alerts/resolved endpoint       │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Database Schema

Key Fields:

FieldTypeDescription
alert_config_idBIGINTLink to alert configuration (unique per active alert)
discovered_versionVARCHARVersion found by scrape job
latest_versionVARCHARLatest upstream version
behind_byINTHow far behind (days, majors, or minors)
severityENUMmoderate, high, or critical
is_acknowledgedBOOLEANWhether user has acknowledged
resolved_atTIMESTAMPWhen resolved (NULL = active)

Unique Constraint: Partial unique index on alert_config_id WHERE resolved_at IS NULL

  • Ensures only ONE active alert per config
  • Allows multiple resolved alerts in history

Upsert Behavior

Alerts use a partial unique index for upsert operations:

INSERT INTO alerts (...)
ON CONFLICT (alert_config_id) WHERE resolved_at IS NULL
DO UPDATE SET
    discovered_version = EXCLUDED.discovered_version,
    latest_version = EXCLUDED.latest_version,
    behind_by = EXCLUDED.behind_by,
    severity = EXCLUDED.severity,
    details = EXCLUDED.details,
    updated_at = CURRENT_TIMESTAMP,
    -- Only reset acknowledgement when version actually changes
    is_acknowledged = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN FALSE
        ELSE alerts.is_acknowledged
    END,
    acknowledged_by = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
        ELSE alerts.acknowledged_by
    END,
    acknowledged_at = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
        ELSE alerts.acknowledged_at
    END

Key Behaviors:

  1. Single Alert: Each config has at most one active alert
  2. In-Place Updates: When the scrape job discovers a new version, the existing alert updates with the new version data
  3. Smart Acknowledgment Reset: Acknowledgment is only cleared when the discovered version changes, not on every evaluation
  4. Preserved History: Resolved alerts remain in the database with resolved_at set

Alert Resolution (Soft Delete)

When an alert config no longer violates (e.g., version was updated):

-- ResolveAlert: Soft delete by setting resolved_at
UPDATE alerts
SET resolved_at = CURRENT_TIMESTAMP,
    updated_at = CURRENT_TIMESTAMP
WHERE id = @id AND resolved_at IS NULL
RETURNING *;

Resolution triggers notification: When an alert is resolved, an alert.resolved notification is dispatched to configured channels.

Audit Trail (Alert Actions)

Every alert state change is recorded in the audit_trail table using a decorator pattern. The AuditedAlertService wraps the AlertService interface and transparently records entries after each successful operation.

Tracked Actions:

ActionTriggerSource
createdRule evaluation creates a new alertsystem
escalatedRule evaluation increases severitysystem
acknowledgeUser acknowledges via UI/API/webhookui, api, or webhook
unacknowledgeUser removes acknowledgment, or version changes reset itui, api, or system
resolveUser manually resolves, or rule evaluation resolvesui, api, or system

Source Determination: The source is derived from the authentication method used:

  • JWT authentication → ui (browser-based user action)
  • API key authentication → api (programmatic access)
  • Webhook callback token → webhook (external callback)
  • No auth context (system operation) → system (rules engine, auto-reset)

Data Model: The audit_trail table uses a polymorphic design (entity_type + entity_id) that can be extended to other entity types in the future. Each entry stores the action type, source, optional actor email (for user-initiated actions), and optional metadata as JSONB. Audit records persist independently of entity lifecycle (no cascade delete on the alert FK).

API Endpoint: GET /alerts/{id}/actions returns paginated audit trail entries for a specific alert. The alert detail UI merges these entries with notification deliveries into a unified activity timeline sorted by timestamp.

Prometheus Metrics: The audit writer exposes planekeeper_audit_writer_events_written_total, planekeeper_audit_writer_persist_errors_total, and planekeeper_audit_writer_insert_duration_seconds_total via the /metrics endpoint.

Auto-Triggers (Event-Driven)

Rule evaluation is triggered asynchronously via the event bus system (pkg/events/). When triggering events occur, they are published to the event bus, and the RuleEvaluationSubscriber handles the evaluation in a goroutine.

Event Flow:

┌─────────────────────┐    ┌──────────────────┐    ┌─────────────────────────┐
│  Job Completion /   │───▶│    Event Bus     │───▶│ RuleEvaluationSubscriber│
│  Config Change      │    │  (pkg/events/)   │    │ (async goroutine)       │
└─────────────────────┘    └──────────────────┘    └─────────────────────────┘

Triggering Events:

Event TypeTriggerPublished By
job.scrape.completedScrape job completes successfullytasks_handler.go
job.gather.completedGather job completes successfullytasks_handler.go
alert_config.createdAlert config created and activealert_configs_handler.go
alert_config.updatedAlert config updated and activealert_configs_handler.go
alert_config.toggledAlert config toggled to activealert_configs_handler.go

Benefits of Event-Driven Triggers:

  1. Decoupling: Handlers don’t need direct references to the rule evaluation logic
  2. Non-blocking: HTTP handlers return immediately; evaluation runs async
  3. Extensibility: Other subscribers can react to the same events (metrics, logging, etc.)
  4. Scalability: Event bus can be swapped for external queue (Redis, NATS) for distributed evaluation

Alert Config Composition

An alert config links three entities:

┌─────────────────────────────────────────────────────────────┐
│                    ALERT CONFIG                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  scrape_job_id  ───▶  "What version did we deploy?"         │
│                       (discovered_version)                   │
│                                                              │
│  gather_job_id  ───▶  "What's the latest available?"        │
│                       (latest_version)                       │
│                                                              │
│  rule_id        ───▶  "How do we measure staleness?"        │
│                       (days_behind, majors_behind, etc.)     │
│                                                              │
│  UNIQUE (org_id, scrape_job_id, gather_job_id, rule_id)     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Active Alerts:

MethodPathDescription
GET/alertsList active (non-resolved) alerts
GET/alerts/summaryCount active alerts by severity
POST/alerts/{id}/acknowledgeAcknowledge an active alert
POST/alerts/{id}/unacknowledgeRemove acknowledgement
POST/alerts/acknowledgeBulk acknowledge multiple alerts

Resolved Alerts (History):

MethodPathDescription
GET/alerts/resolvedList resolved alerts with pagination

Query Parameters (for both endpoints):

  • limit - Max results (default 50, max 100)
  • offset - Pagination offset
  • severity - Filter by severity level
  • unacknowledged_only - Only show unacknowledged (active only)

Notification Events

The alert system dispatches notifications for lifecycle events:

EventTrigger
alert.createdNew alert created (first violation)
alert.escalatedSeverity increased (e.g., high → critical)
alert.acknowledgedUser acknowledges via API
alert.unacknowledgedUser removes acknowledgement
alert.resolvedVersion updated, no longer violates

Note: Non-escalating updates (same severity, just refreshed data) do not trigger notifications.

Event-Driven Alert Service

Files: pkg/alerts/service.go

All alert state changes flow through a centralized alert service (pkg/alerts/Service) that automatically dispatches notifications. This ensures consistent notification behavior regardless of how alerts are modified (API, webhook, rules engine).

Architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                      ALERT STATE CHANGES                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────┐                                                 │
│  │ API Handlers    │──┐                                              │
│  │ (ack/unack)     │  │                                              │
│  └─────────────────┘  │                                              │
│                       │      ┌─────────────────────────────────────┐ │
│  ┌─────────────────┐  │      │        ALERT SERVICE               │ │
│  │ Webhook Ack     │──┼─────▶│  pkg/alerts/service.go             │ │
│  │ (external)      │  │      │                                     │ │
│  └─────────────────┘  │      │  • Acknowledge()                    │ │
│                       │      │  • Unacknowledge()                  │ │
│  ┌─────────────────┐  │      │  • Upsert() (create/update)         │ │
│  │ Rules Engine    │──┘      │  • Resolve()                        │ │
│  │ (evaluation)    │         │                                     │ │
│  └─────────────────┘         │  ─────────────────────────────────  │ │
│                              │  Auto-dispatches notifications:     │ │
│                              │  • alert.created                    │ │
│                              │  • alert.escalated                  │ │
│                              │  • alert.acknowledged               │ │
│                              │  • alert.unacknowledged             │ │
│                              │  • alert.resolved                   │ │
│                              └──────────────┬──────────────────────┘ │
│                                             │                        │
│                                             ▼                        │
│                              ┌─────────────────────────────────────┐ │
│                              │   Notification Dispatcher           │ │
│                              │   pkg/notifications/dispatcher.go   │ │
│                              └─────────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Service Methods:

MethodDescriptionEvent Dispatched
Acknowledge()Mark alert acknowledgedalert.acknowledged
Unacknowledge()Remove acknowledgmentalert.unacknowledged
Upsert()Create or update alertalert.created or alert.escalated
Resolve()Mark alert resolvedalert.resolved
ResolveByConfigID()Resolve by configalert.resolved
BulkAcknowledge()Bulk operationNone (avoids spam)

Benefits:

  1. Consistent Notifications: Every state change automatically dispatches events
  2. Single Source of Truth: All alert logic centralized in one service
  3. Thin Handlers: API handlers become simple pass-through to service
  4. Testability: Service can be mocked for unit testing

Usage Pattern:

// Before (scattered notification dispatch):
alert, err := repo.AcknowledgeAlert(ctx, params)
dispatcher.DispatchForAlert(ctx, alert, EventAlertAcknowledged)

// After (single service call does everything):
alert, err := alertService.Acknowledge(ctx, params)
// Notification automatically dispatched

Event Bus System

Files: pkg/events/bus.go, pkg/events/types.go, pkg/events/subscribers.go

The system uses an in-process event bus for decoupled asynchronous communication between components. This enables loose coupling while maintaining reliability within a single process.

Architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                        EVENT BUS SYSTEM                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Publishers                     Event Bus                Subscribers │
│  ───────────                    ─────────                ─────────── │
│                                                                      │
│  ┌─────────────────┐      ┌─────────────────────┐                    │
│  │ TasksHandler    │──┐   │                     │   ┌──────────────┐ │
│  │ (job completed) │  │   │  In-process Bus     │   │ RuleEval     │ │
│  └─────────────────┘  │   │                     │   │ Subscriber   │ │
│                       ├──▶│  • Buffered channel │──▶│              │ │
│  ┌─────────────────┐  │   │  • Async delivery   │   │ Evaluates    │ │
│  │ AlertConfigs    │──┘   │  • Panic recovery   │   │ rules for    │ │
│  │ (config change) │      │                     │   │ organization │ │
│  └─────────────────┘      └─────────────────────┘   └──────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Event Types:

Event TypeDescriptionPayload
job.scrape.completedScrape job finishedJobCompletedEvent
job.gather.completedGather job finishedJobCompletedEvent
job.helm_sync.completedHelm sync job finishedJobCompletedEvent
alert_config.createdAlert config createdAlertConfigChangedEvent
alert_config.updatedAlert config updatedAlertConfigChangedEvent
alert_config.toggledAlert config toggledAlertConfigChangedEvent

Event Bus Features:

  1. Buffered Channel: Default buffer of 100 events; non-blocking publish with overflow warning
  2. Async Delivery: Single dispatcher goroutine processes events sequentially
  3. Panic Recovery: Handler panics are caught and logged without crashing the bus
  4. Graceful Shutdown: Stop() method drains pending events before closing

Subscriber Pattern:

// Subscribers implement their own logic
type RuleEvaluationSubscriber struct {
    db           *postgres.Database
    alertService *alerts.Service
}

// Subscribe registers handlers for relevant event types
func (s *RuleEvaluationSubscriber) Subscribe(bus *events.Bus) {
    bus.SubscribeMultiple([]events.EventType{
        events.EventJobScrapeCompleted,
        events.EventJobGatherCompleted,
    }, s.handleJobCompleted)
}

Future Extensibility:

The event bus interface can be swapped for an external message queue (Redis Streams, NATS, RabbitMQ) if horizontal scaling requires distributed event processing. The subscriber pattern remains the same; only the transport layer changes.


9. Notification System

Files: pkg/notifications/, pkg/api/notification_*_handlers.go, app/notifier/, pkg/storage/queries/notification_*.sql

Purpose

Deliver notifications about alert events to external systems via webhooks and other channel types. The system supports configurable routing rules, retry logic with exponential backoff, and acknowledgment callbacks.

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         SERVER                                    │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                    Dispatcher                                │ │
│  │  - Evaluates notification rules on alert events              │ │
│  │  - Creates delivery records in notification_deliveries       │ │
│  │  - Does NOT send webhooks directly                           │ │
│  └──────────────────────────┬──────────────────────────────────┘ │
└─────────────────────────────┼────────────────────────────────────┘
                              │ INSERT into notification_deliveries
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    PostgreSQL                                    │
│  notification_deliveries (status: pending)                       │
└─────────────────────────────┬───────────────────────────────────┘
                              │ FOR UPDATE SKIP LOCKED
                              ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  notifier   │  │  notifier   │  │  notifier   │
│  replica 1  │  │  replica 2  │  │  replica N  │
│  (worker)   │  │  (worker)   │  │  (worker)   │
└─────────────┘  └─────────────┘  └─────────────┘
      │                │                │
      └────────────────┼────────────────┘
                       ▼
              External Webhooks

9.1 Channel Types

Channel TypeStatusDescription
webhookImplementedHTTP POST to external URL with JSON payload
pagerdutyPlannedNative PagerDuty Events API integration
telegramPlannedTelegram Bot API integration
smtpPlannedEmail notifications

9.2 Notification Channels

Table: notification_channels

Channels define delivery endpoints (webhooks) with organization-scoped configuration.

FieldDescription
idUnique identifier
organization_idOwning organization
nameHuman-readable name
channel_typeType: webhook, pagerduty, telegram, smtp
configJSONB with channel-specific configuration
is_activeWhether channel is enabled
max_retriesPer-channel retry override (NULL = global default)
last_test_atLast test timestamp
last_test_successLast test result

Webhook Config Structure (stored in config JSONB):

{
  "url": "https://example.com/webhook",
  "method": "POST",
  "headers": {"Authorization": "Bearer xxx"},
  "timeout_seconds": 30,
  "ack_enabled": true,
  "secret": "hmac-signing-secret",
  "event_templates": {
    "new_alert": "",
    "acknowledged": "",
    "resolved": ""
  }
}

9.2.1 Event-Specific Templates

The notification system supports event-specific templates that allow customizing webhook payloads for different alert lifecycle events. This enables integration with services like Discord and Slack that require specific payload formats.

Template Categories:

CategoryEventsDescription
new_alertalert.created, alert.escalated, testNew or escalated alerts
acknowledgedalert.acknowledged, alert.unacknowledgedAcknowledgment state changes
resolvedalert.resolvedAlert resolution

Template Resolution Priority:

Templates are resolved in order of specificity (most specific wins):

1. Channel-specific template (config.event_templates.X)
       │
       ▼ (if empty)
2. Organization-level template (settings table)
       │
       ▼ (if empty)
3. Global default template (settings table, org_id = NULL)
       │
       ▼ (if empty)
4. Standard JSON payload (no template)

This allows:

  • Global defaults for all organizations (generic JSON for standard webhooks)
  • Organization overrides for org-wide customization
  • Channel-specific templates for platforms like Discord/Slack

Global Template Settings Keys:

Setting KeyCategory
notification.template.alert.newNew/escalated alerts
notification.template.alert.acknowledgedAcknowledgment events
notification.template.alert.resolvedResolution events

9.2.2 Template Variables

Common Variables (available in all templates):

{{.IdempotencyKey}}           - UUID, stable across retries
{{.Event}}                    - Event type (alert.created, etc.)
{{.Timestamp}}                - ISO8601 timestamp
{{.Alert.ID}}                 - Alert ID
{{.Alert.ConfigName}}         - Alert config name
{{.Alert.RuleName}}           - Rule name
{{.Alert.RuleType}}           - days_behind, majors_behind, etc.
{{.Alert.Severity}}           - critical, high, moderate
{{.Alert.DiscoveredVersion}}  - Current deployed version
{{.Alert.LatestVersion}}      - Latest available version
{{.Alert.BehindBy}}           - Number (days, versions, etc.)
{{.Alert.ArtifactName}}       - Upstream artifact name
{{.Alert.RepositoryURL}}      - Scrape job repository URL
{{.Alert.TargetFile}}         - Scrape job target file

Event-Specific Variables:

VariableAvailable InDescription
{{.AcknowledgeURL}}new_alert onlyCallback URL for one-click acknowledgment
{{.AcknowledgedBy}}acknowledged onlyEmail/identifier of acknowledging user
{{.AcknowledgedAt}}acknowledged onlyISO8601 timestamp of acknowledgment
{{.IsAcknowledged}}acknowledged onlytrue for acknowledged, false for unacknowledged
{{.ResolvedAt}}resolved onlyISO8601 timestamp of resolution
{{.PreviousSeverity}}new_alert (escalated)Previous severity before escalation

Template Functions:

FunctionDescriptionExample
upperUppercase string{{.Alert.Severity | upper}}CRITICAL
lowerLowercase string{{.Event | lower}}alert.created
jsonJSON encode value{{.Alert | json}}{"id":1,...}

9.2.3 Platform-Specific Examples

Generic JSON Webhook (default format when no template configured):

The system sends a structured JSON payload by default:

{
  "idempotency_key": "uuid-here",
  "event": "alert.created",
  "timestamp": "2024-01-15T10:30:00Z",
  "alert": {
    "id": 123,
    "config_name": "My Alert Config",
    "rule_name": "Critical Updates",
    "severity": "critical",
    "discovered_version": "1.0.0",
    "latest_version": "2.0.0",
    "behind_by": 30
  },
  "acknowledge_url": "https://app.example.com/api/v1/webhooks/acknowledge/token"
}

Discord Webhook:

Discord requires a content field. Configure these channel-specific templates:

New Alert Template:

{"content": "🚨 **{{.Alert.Severity | upper}} Alert**: {{.Alert.ConfigName}}\n\n**Artifact:** {{.Alert.ArtifactName}}\n**Current:** {{.Alert.DiscoveredVersion}} → **Latest:** {{.Alert.LatestVersion}}\n**Behind by:** {{.Alert.BehindBy}} {{.Alert.RuleType}}\n\n[Acknowledge]({{.AcknowledgeURL}})"}

Acknowledged Template:

{"content": "{{if .IsAcknowledged}}✅ **Acknowledged**{{else}}🔄 **Unacknowledged**{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\n\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}

Resolved Template:

{"content": "🎉 **Resolved**: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\n\nThe version has been updated and no longer triggers this alert."}

Slack Webhook:

Slack uses a text field for simple messages:

New Alert Template:

{"text": ":rotating_light: *{{.Alert.Severity | upper}}*: {{.Alert.ConfigName}}\nArtifact: {{.Alert.ArtifactName}}\nVersion {{.Alert.DiscoveredVersion}} is {{.Alert.BehindBy}} behind latest ({{.Alert.LatestVersion}})\n<{{.AcknowledgeURL}}|Click to Acknowledge>"}

Acknowledged Template:

{"text": "{{if .IsAcknowledged}}:white_check_mark: *Acknowledged*{{else}}:arrows_counterclockwise: *Unacknowledged*{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}

*Resolved Template*:
```json
{"text": ":tada: *Resolved*: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\nThe alert has been automatically resolved."}

Slack Block Kit (rich formatting):

For more sophisticated Slack messages, use Block Kit:

New Alert Template:

{
  "blocks": [
    {
      "type": "header",
      "text": {"type": "plain_text", "text": "🚨 {{.Alert.Severity | upper}} Alert"}
    },
    {
      "type": "section",
      "fields": [
        {"type": "mrkdwn", "text": "*Config:*\n{{.Alert.ConfigName}}"},
        {"type": "mrkdwn", "text": "*Artifact:*\n{{.Alert.ArtifactName}}"},
        {"type": "mrkdwn", "text": "*Current:*\n{{.Alert.DiscoveredVersion}}"},
        {"type": "mrkdwn", "text": "*Latest:*\n{{.Alert.LatestVersion}}"}
      ]
    },
    {
      "type": "actions",
      "elements": [
        {
          "type": "button",
          "text": {"type": "plain_text", "text": "Acknowledge"},
          "url": "{{.AcknowledgeURL}}",
          "style": "primary"
        }
      ]
    }
  ]
}

Microsoft Teams (Adaptive Cards):

New Alert Template:

{
  "@type": "MessageCard",
  "themeColor": "FF0000",
  "title": "{{.Alert.Severity | upper}} Alert: {{.Alert.ConfigName}}",
  "sections": [{
    "facts": [
      {"name": "Artifact", "value": "{{.Alert.ArtifactName}}"},
      {"name": "Current Version", "value": "{{.Alert.DiscoveredVersion}}"},
      {"name": "Latest Version", "value": "{{.Alert.LatestVersion}}"},
      {"name": "Behind By", "value": "{{.Alert.BehindBy}}"}
    ]
  }],
  "potentialAction": [{
    "@type": "OpenUri",
    "name": "Acknowledge",
    "targets": [{"os": "default", "uri": "{{.AcknowledgeURL}}"}]
  }]
}

PagerDuty Events API v2:

New Alert Template:

{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "{{.IdempotencyKey}}",
  "payload": {
    "summary": "{{.Alert.Severity | upper}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}} is {{.Alert.BehindBy}} behind",
    "source": "planekeeper",
    "severity": "{{.Alert.Severity}}",
    "custom_details": {
      "discovered_version": "{{.Alert.DiscoveredVersion}}",
      "latest_version": "{{.Alert.LatestVersion}}",
      "repository": "{{.Alert.RepositoryURL}}",
      "target_file": "{{.Alert.TargetFile}}"
    }
  },
  "links": [{"href": "{{.AcknowledgeURL}}", "text": "Acknowledge in Planekeeper"}]
}

Resolved Template:

{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "resolve",
  "dedup_key": "{{.IdempotencyKey}}"
}

9.2.4 Configuring Templates

Via UI:

  1. Navigate to Notification Channels → Edit channel
  2. Check “Use Event-Specific Templates”
  3. Enter templates for each event category (New Alert, Acknowledged, Resolved)
  4. Leave empty to inherit from organization/global defaults

Via API:

# Create channel with event-specific templates
curl -X POST /api/v1/client/notification-channels \
  -H "X-API-Key: pk_..." \
  -d '{
    "name": "Discord Alerts",
    "channel_type": "webhook",
    "config": {
      "url": "https://discord.com/api/webhooks/...",
      "event_templates": {
        "new_alert": "{\"content\": \"🚨 **{{.Alert.Severity}}**: {{.Alert.ConfigName}}\"}",
        "acknowledged": "{\"content\": \"✅ Acknowledged: {{.Alert.ConfigName}}\"}",
        "resolved": "{\"content\": \"🎉 Resolved: {{.Alert.ConfigName}}\"}"
      }
    }
  }'

Organization-Level Defaults:

Set organization-wide templates via the settings API:

# Set org default for new alerts
curl -X PUT /api/v1/client/settings/notification.template.alert.new \
  -H "X-API-Key: pk_..." \
  -d '{"value": "{\"content\": \"🚨 {{.Alert.Severity}}: {{.Alert.ConfigName}}\"}"}'

9.2.5 Template Best Practices

  1. Use channel-specific templates for non-standard webhooks: Discord, Slack, Teams, and PagerDuty all have specific payload formats. Configure these at the channel level.

  2. Keep global defaults generic: The default templates produce standard JSON suitable for custom integrations. Don’t change these unless you want all organizations to use a specific format.

  3. Test templates before enabling: Use the channel test endpoint to verify your template produces valid output for the target platform.

  4. Include relevant context per event type:

    • New alerts: Include acknowledge URL, version details, severity
    • Acknowledged: Include who acknowledged and when
    • Resolved: Keep it simple - the alert is no longer actionable
  5. Escape special characters: JSON strings require escaping. Use \n for newlines, \" for quotes within strings.

  6. Use template functions: upper, lower, and json help format output appropriately for different platforms

9.3 Notification Rules

Table: notification_rules

Rules define routing logic: which events go to which channels based on severity and event type.

FieldDescription
idUnique identifier
organization_idOwning organization
nameHuman-readable name
severity_filterArray of severities to match (empty = all)
event_filterArray of event types to match (empty = all)
channel_idOverride channel (NULL = use org default)
group_intervalGroup alerts within this window
repeat_intervalDon’t repeat for same alert within this window
is_activeWhether rule is enabled
priorityHigher priority rules evaluated first

Event Types:

EventDescription
alert.createdNew violation detected
alert.escalatedSeverity increased
alert.acknowledgedAlert marked as acknowledged
alert.unacknowledgedAcknowledgment reset (re-violation)
alert.resolvedVersion updated, no longer violates

Rule Evaluation:

1. Get active rules for org, ordered by priority DESC
       │
       ▼
2. For each rule:
   ├── Check severity_filter (empty = match all)
   ├── Check event_filter (empty = match all)
   │
   └── If match:
       ├── Check group/repeat intervals (prevent spam)
       ├── Get channel (rule override or org default)
       └── Create delivery record
       │
       ▼
3. Deduplicate channels (same alert → same channel only once)

9.4 Delivery Lifecycle

Table: notification_deliveries

Tracks the state and history of each notification delivery attempt.

┌─────────────────────────────────────────────────────────────────┐
│                   DELIVERY LIFECYCLE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌───────────┐                │
│  │ PENDING │───▶│ IN_PROGRESS │───▶│ SUCCEEDED │                │
│  └─────────┘    └─────────────┘    └───────────┘                │
│       │               │                                          │
│       │               │  Error/timeout                           │
│       │               ▼                                          │
│       │         ┌──────────┐                                     │
│       │         │  FAILED  │◀─────────────────┐                  │
│       │         └──────────┘                  │                  │
│       │               │                       │                  │
│       │               │ Retry (attempts < max)│                  │
│       │               ▼                       │                  │
│       │         ┌─────────────┐               │                  │
│       │         │ IN_PROGRESS │───────────────┘                  │
│       │         └─────────────┘                                  │
│       │               │                                          │
│       │               │ Max attempts exceeded                    │
│       │               ▼                                          │
│       │         ┌─────────────┐                                  │
│       └────────▶│ DEAD_LETTER │                                  │
│                 └─────────────┘                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Status Transitions:

FromToTrigger
-pendingDispatcher creates delivery
pendingin_progressNotifier claims delivery (SKIP LOCKED)
in_progresssucceeded2xx HTTP response
in_progressfailedError or non-2xx response (retries remain)
failedin_progressRetry timer expires
faileddead_letterMax attempts exceeded or 24h TTL

9.5 Retry Logic

Multi-tier Exponential Backoff:

TierAttemptsDelaysUse Case
Short-term1-410s, 30s, 1m, 5mTransient errors
Mid-term5-815m, 30m, 1h, 2hService degradation
Long-term9-124h, 4h, 4h, 4hExtended outage

Total TTL: ~24 hours from first attempt → dead_letter

Jitter: Full jitter applied (delay = random(0, computedDelay)) to prevent thundering herd.

Retry-After Header: Honored from 429/503 responses.

Non-retryable Errors:

  • 4xx responses (except 429) → immediate dead_letter
  • Invalid configuration → immediate dead_letter

9.6 Webhook Payload

Default JSON Payload:

{
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000",
  "event": "alert.created",
  "timestamp": "2026-02-04T12:00:00Z",
  "alert": {
    "id": 123,
    "config_name": "K8s Dashboard Version Check",
    "rule_name": "Days Behind",
    "severity": "critical",
    "discovered_version": "1.25.0",
    "latest_version": "1.30.0",
    "behind_by": 45,
    "artifact_name": "kubernetes/kubernetes",
    "repository_url": "https://github.com/org/repo",
    "target_file": "chart/Chart.yaml"
  },
  "acknowledge_url": "https://planekeeper.example.com/api/v1/webhooks/acknowledge/{token}"
}

Headers:

  • Content-Type: application/json
  • X-Planekeeper-Signature: sha256={hmac} (if secret configured)
  • X-Planekeeper-Timestamp: {unix_seconds}
  • X-Planekeeper-Event: alert.created
  • X-Planekeeper-Idempotency-Key: {uuid}
  • Custom headers from channel config

HMAC Signature: HMAC-SHA256(secret, timestamp + "." + body)

9.7 Inbound Acknowledgment

Endpoint: POST /api/v1/webhooks/acknowledge/{token}

External systems can acknowledge alerts by calling the acknowledge_url included in the webhook payload.

Token Properties:

  • Generated per delivery
  • Stored in notification_ack_tokens table
  • Expires after 24 hours (configurable)
  • Single-use (marked as used after acknowledgment)

Flow:

External System                    Planekeeper
      │                                 │
      │  POST /webhooks/acknowledge/xyz │
      │────────────────────────────────▶│
      │                                 │ Lookup token
      │                                 │ Validate not expired
      │                                 │ Validate not already used
      │                                 │ Mark alert as acknowledged
      │                                 │ Mark token as used
      │      200 OK                     │
      │◀────────────────────────────────│

9.8 SSRF Protection

Webhook URLs are validated to prevent Server-Side Request Forgery:

Blocked by default:

  • Private IP ranges (RFC1918): 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
  • Localhost: 127.0.0.0/8, ::1
  • Link-local: 169.254.0.0/16, fe80::/10

Allowed schemes: https:// (default), optionally http://

Environment variable: NOTIFICATION_ALLOW_PRIVATE_URLS=false (default)

9.9 Housekeeping

The notifier service runs periodic cleanup tasks:

TaskIntervalDescription
Expire ack tokens1hDelete expired tokens from notification_ack_tokens
Cleanup expired pending1hMove stuck deliveries to dead_letter after 24h
Purge old succeededdailyDelete succeeded deliveries older than 30 days

9.10 API Endpoints

Notification Channels:

MethodPathDescription
GET/notification-channelsList org’s channels
POST/notification-channelsCreate channel
GET/notification-channels/{id}Get channel
PUT/notification-channels/{id}Update channel
DELETE/notification-channels/{id}Delete channel
POST/notification-channels/{id}/testTest channel connectivity
POST/notification-channels/{id}/toggleToggle active state
GET/notification-channels/{id}/statsGet delivery statistics

9.11 Channel Test Endpoint

The test endpoint (POST /notification-channels/{id}/test) performs comprehensive validation and sends a test notification.

Test Sequence:

  1. Config Validation: Validate webhook URL and template syntax
  2. Connectivity Check: Verify URL is reachable (optional HEAD request)
  3. Sample Delivery: Send actual test payload to webhook
  4. Record Results: Store test timestamp and success/failure

Response Structure (NotificationChannelTestResult):

{
  "success": true,
  "tested_at": "2026-02-05T12:00:00Z",
  "idempotency_key": "test-550e8400-e29b-41d4-a716-446655440000",
  "error": null,
  "validation_errors": [],
  "connectivity_check": {
    "status": 200,
    "latency_ms": 150,
    "error": null
  },
  "sample_delivery": {
    "status": 200,
    "latency_ms": 450,
    "response_preview": "OK",
    "error": null
  }
}

Error Response Examples:

Validation failure:

{
  "success": false,
  "tested_at": "2026-02-05T12:00:00Z",
  "error": "Configuration validation failed",
  "validation_errors": ["Invalid webhook URL: private IP addresses not allowed"]
}

Delivery failure:

{
  "success": false,
  "tested_at": "2026-02-05T12:00:00Z",
  "error": "Test delivery failed",
  "sample_delivery": {
    "status": 400,
    "latency_ms": 250,
    "response_preview": "{\"message\":\"Invalid content type\"}",
    "error": "webhook returned 400"
  }
}

9.12 UI Error Handling Pattern

The clientui uses a standardized pattern for surfacing detailed API errors to users.

Error Flow:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   API       │───▶│  Services   │───▶│  Handler    │───▶│    UI       │
│  Response   │    │   Layer     │    │  Formatter  │    │  Redirect   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                   │                  │                  │
      │ Detailed JSON     │ Extract fields   │ Build message    │ URL-encoded
      │ with nested       │ (status, error,  │ with context     │ query param
      │ error objects     │ preview, etc.)   │ (HTTP codes,     │ (?error=...)
      │                   │                  │ response text)   │

Services Layer (internal/services/api_client.go):

The services layer extracts all error details from API responses:

type NotificationChannelTestResult struct {
    Success          bool
    ErrorMessage     *string
    TestedAt         *time.Time
    ValidationErrors []string

    // Connectivity check results
    ConnectivityStatus  *int
    ConnectivityError   *string
    ConnectivityLatency *int64

    // Sample delivery results
    DeliveryStatus   *int
    DeliveryError    *string
    DeliveryLatency  *int64
    ResponsePreview  *string
}

Handler Formatter (internal/handlers/notification_channels.go):

Handlers format user-friendly error messages from detailed results:

func formatTestErrorMessage(result *NotificationChannelTestResult) string {
    var parts []string

    // Check validation errors first
    if len(result.ValidationErrors) > 0 {
        parts = append(parts, "Validation errors: "+result.ValidationErrors[0])
    }

    // Check delivery issues (most common)
    if result.DeliveryStatus != nil && *result.DeliveryStatus >= 400 {
        msg := "Delivery failed with HTTP " + strconv.Itoa(*result.DeliveryStatus)
        if result.ResponsePreview != nil {
            msg += " - " + truncate(*result.ResponsePreview, 100)
        }
        parts = append(parts, msg)
    }

    // URL-encode for redirect
    return urlEncode(strings.Join(parts, "; "))
}

UI Display:

Errors are passed via URL query parameters and displayed in the page template:

/notification-channels/1?error=Delivery+failed+with+HTTP+400+-+%7B%22message%22%3A%22Invalid+payload%22%7D

The template renders this as a styled error banner showing:

Delivery failed with HTTP 400 - {“message”:“Invalid payload”}

Key Principles:

  1. Preserve Context: Pass HTTP status codes, response bodies, and specific error types through all layers
  2. Prioritize Actionable Info: Show validation errors first, then HTTP status, then generic errors
  3. Truncate for Safety: Limit response previews to prevent URL length issues
  4. URL-Safe Encoding: Properly encode error messages for query parameter use

Notification Rules:

MethodPathDescription
GET/notification-rulesList org’s rules
POST/notification-rulesCreate rule
GET/notification-rules/{id}Get rule
PUT/notification-rules/{id}Update rule
DELETE/notification-rules/{id}Delete rule
POST/notification-rules/{id}/toggleToggle active state

Delivery History:

MethodPathDescription
GET/notification-deliveriesList deliveries (filterable)
GET/notification-deliveries/deadList dead letters
POST/notification-deliveries/{id}/retryRetry a dead letter
GET/alerts/{id}/deliveriesDeliveries for specific alert

9.13 Notifier Service

The notifier binary (service_id=7) is a standalone worker that:

  1. Polls for pending/failed deliveries ready for retry
  2. Claims deliveries using FOR UPDATE SKIP LOCKED (distributed locking)
  3. Sends webhooks to configured URLs
  4. Updates delivery status based on response
  5. Runs housekeeping tasks periodically

Scaling: Run multiple replicas for horizontal scaling. Each replica claims different deliveries without coordination.

Configuration:

VariableDescriptionDefault
NOTIFICATION_BATCH_SIZEDeliveries to claim per poll100
NOTIFICATION_POLL_INTERVALHow often to check for work5s
NOTIFICATION_MAX_RETRIESMax attempts before dead letter12
NOTIFICATION_ACK_TOKEN_EXPIRYToken expiry duration24h
NOTIFICATION_BASE_URLBase URL for ack callbacks-

10. Multi-Tenancy Model

Files: pkg/api/middleware.go, pkg/api/middleware_auth.go, pkg/storage/queries/gather_jobs.sql, pkg/storage/queries/organization_members.sql

10.1 Authentication Methods

The system supports two authentication methods:

MethodUsed ByHow It Works
API KeyAgents, machines, InternalUIX-API-Key header or planekeeper_api_key cookie
Supabase JWTHuman users (ClientUI)Authorization: Bearer header + X-Organization-Id header

Dual Auth Middleware (pkg/api/middleware_auth.go):

The API server tries JWT auth first, then falls back to API key auth. When Supabase is not configured (no SUPABASE_JWT_SECRET), only API key auth is available.

Incoming Request
    │
    ├── Has "Authorization: Bearer" header?
    │   ├── YES → Validate JWT → Lookup user by supabase_id
    │   │         → Read X-Organization-Id → Verify membership → Allow
    │   └── NO  → Fall through
    │
    ├── Has "X-API-Key" header or cookie?
    │   ├── YES → Validate API key → Extract org_id → Allow
    │   └── NO  → 401 Unauthorized

10.2 Organization Scoping

All API requests are scoped to an organization:

  • API Key auth: Organization derived from api_keys.organization_id in the database
  • JWT auth: Organization specified by the X-Organization-Id header, validated against the user’s membership
organization_id stored in request context (Locals)
    │
    ▼
All queries filter by organization_id

10.3 Multi-Organization Membership

Users can belong to multiple organizations via the organization_members join table.

Schema:

TableKey ColumnsPurpose
organization_membersuser_id, organization_id, roleMembership records
organization_invitesemail, organization_id, role, tokenPending invitations

Roles (org_role enum):

RoleCapabilities
ownerFull control, manage members
adminManage resources, invite members
memberRead/write org resources

Membership Flow:

User signs up (Supabase) → is_approved = FALSE → Pending approval page
    │
    ▼
Admin approves (SQL: UPDATE users SET is_approved = TRUE WHERE email = '...')
    │
    ▼
User logs out and back in → No memberships → Onboarding page
    │
    ├── Create new organization → Owner membership created
    │
    └── Accept pending invite → Member/admin membership created

10.8 Beta User Approval Gating

Files: pkg/storage/migration/sql/023_user_approval.sql, pkg/auth/middleware.go, internal/handlers/auth.go

New user signups are blocked until an administrator manually approves them. This is enforced via the is_approved column on the users table.

How it works:

Scenariois_approvedBehavior
Existing users (pre-migration)TRUE (default)Unaffected
System usersTRUE (default)Unaffected
New Supabase signupsFALSE (explicit)Redirected to pending approval page
Admin-approved usersTRUE (manual SQL)Normal login flow

Enforcement points (defense-in-depth):

  1. processLogin (primary): After finding/creating the user, checks is_approved. If false, redirects to /pending-approval instead of checking org membership.
  2. RequireOnboarded middleware (secondary): Checks session.IsApproved and redirects to /pending-approval. Prevents bypassing via direct URL access.

Session caching: The is_approved value is stored in the encrypted session cookie (SessionData.IsApproved). Users must log out and back in after being approved for the change to take effect.

Admin approval (SQL):

-- Approve a user
UPDATE users SET is_approved = TRUE WHERE email = 'user@example.com';

-- List unapproved users
SELECT id, email, created_at FROM users WHERE is_approved = FALSE AND is_system = FALSE;

10.4 Organization Switching (ClientUI)

Users with multiple org memberships can switch between them. The active organization is stored in an HTTP-only cookie (planekeeper_org).

POST /switch-org (form: org_id)
    │
    ▼
Validate user is a member of target org
    │
    ▼
Update planekeeper_org cookie
    │
    ▼
Redirect to dashboard (now showing new org's data)

The sidebar displays an organization dropdown when the user belongs to multiple orgs.

10.5 Scope Types

Scopeorganization_idis_globalUse Case
OrganizationValid (e.g., 123)falseTenant-specific resources
GlobalNULLtrueShared across all orgs

10.6 System API Keys

Identification: organization_id = 0 in auth context (NULL in database).

Capabilities:

  • Create global resources (is_global = true)
  • Access cross-organization data
  • Manage system settings

Creation Flow:

if isSystemKey {
    orgIDParam = pgtype.Int8{}  // NULL
    isGlobal = pgtype.Bool{Bool: true, Valid: true}
}

10.7 List Query Scopes

Most list endpoints support a scope query parameter:

ScopeQuery Filter
orgWHERE organization_id = @org_id AND is_global = FALSE
globalWHERE is_global = TRUE
all (default)WHERE organization_id = @org_id OR is_global = TRUE

11. Agent Communication

Files: pkg/agent/worker.go, pkg/api/agents_handler.go

Heartbeat Protocol

Endpoint: POST /heartbeat/{AgentUUID}

Request Body (optional):

{
  "capabilities": ["gather", "scrape", "helm_sync"],
  "available_credentials": ["github_token", "ssh_key"]
}

Response:

{
  "poll_interval_seconds": 30,
  "rate_limit_max_requests": 100,
  "rate_limit_window_seconds": 60
}

Server-Side:

  • Upserts agent into service_instances table
  • Stores metadata (build_date, capabilities, credentials) as JSON
  • Updates last_heartbeat timestamp

Capability Declaration

Agents declare supported job types during heartbeat:

CapabilityJob Type
gatherFetch upstream releases
scrapeExtract deployed versions
helm_syncDiscover Helm charts

Credential Declaration

Agents declare available credentials for credential-aware job assignment:

// Worker.GetAvailableCredentials()
availableCredentials := w.GetAvailableCredentials()
task, err := w.client.PollTaskWithCredentials(ctx, jobTypes, availableCredentials)

Orphan Cleanup

Service: OrphanCleanupService (runs every 2 minutes)

Logic:

  1. Find jobs claimed by agents not in service_instances
  2. Reset those jobs to pending status
  3. Additional: Reset jobs in in_progress for >1 hour (stale detection)
-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending', claimed_by = NULL
WHERE claimed_by NOT IN (
    SELECT instance_uuid FROM service_instances WHERE service_id = 2
)

12. Metrics API

Files: pkg/api/metrics_handler.go, pkg/storage/queries/system_metrics.sql

Purpose

Expose system-wide operational metrics for monitoring and observability. The metrics endpoint provides a comprehensive view of system health across all organizations, including job status, alert state, service health, and task execution performance. Supports both JSON format for API consumers and Prometheus text format for integration with monitoring systems.

Endpoint

GET /api/v1/internal/metrics

Output Format

RequestResponse FormatContent-Type
/metrics (default)Prometheus exposition formattext/plain; version=0.0.4; charset=utf-8
/metrics?format=jsonJSONapplication/json

Default: Prometheus text format for easy integration with Prometheus scrapers.

Use ?format=json query parameter for programmatic access via JSON.

Authentication

No authentication required. This endpoint is only exposed via the internal Traefik reverse proxy (port 8443), which is restricted to trusted IPs by the hosting provider’s firewall. Security is provided by network-level access control rather than API key authentication.

Metric Categories

Organization Metrics

MetricDescription
organizations.totalTotal number of organizations
organizations.activeNumber of active organizations

Service Instance Metrics

MetricDescription
services[].service_nameName of the service (server, agent, taskengine, etc.)
services[].totalTotal instances of this service
services[].healthyInstances with heartbeat in last 5 minutes
services[].unhealthyInstances without recent heartbeat

Health Threshold: 5 minutes since last heartbeat.

Agent Metrics

MetricDescription
agents.totalTotal registered agents (system-wide)
agents.healthyAgents with heartbeat in last 5 minutes
agents.unhealthyAgents without recent heartbeat

Job Status Metrics

MetricDescription
jobs.gatherCount of gather jobs by status (pending, in_progress, completed, failed)
jobs.scrapeCount of scrape jobs by status (pending, in_progress, completed, failed)
jobs.helm_syncCount of helm sync jobs by status (pending, in_progress, completed, failed)

Scope: Includes all jobs across all organizations.

Alert Metrics

MetricDescription
alerts.totalTotal alert count (system-wide)
alerts.unacknowledgedAlerts requiring attention
alerts.by_severity.criticalCritical severity alerts (unacknowledged)
alerts.by_severity.highHigh severity alerts (unacknowledged)
alerts.by_severity.moderateModerate severity alerts (unacknowledged)

Release Metrics

MetricDescription
releases.totalTotal tracked upstream releases
releases.stableNon-prerelease versions
releases.prereleasePrerelease versions (alpha, beta, rc, etc.)
releases.unique_artifactsDistinct artifact names being tracked

Task Execution Metrics

MetricDescription
task_executions.totalTotal task executions in last 24 hours
task_executions.completedSuccessful completions
task_executions.failedFailed executions
task_executions.in_progressCurrently running
task_executions.success_rateCompletion rate (0-1)

Time Window: Last 24 hours only.

API Key Metrics

MetricDescription
api_keys.totalTotal number of API keys
api_keys.activeNumber of active API keys
api_keys.systemNumber of system API keys

Prometheus Metric Names

All metrics use gauge type (not counters) following Prometheus naming conventions. No org_id labels - these are system-wide metrics:

Prometheus MetricLabelsDescription
planekeeper_organizations-Total organizations
planekeeper_organizations_active-Active organizations
planekeeper_service_instancesservice, statusService instances by type and health
planekeeper_agents-Total agents
planekeeper_agents_healthy-Healthy agents
planekeeper_agents_unhealthy-Unhealthy agents
planekeeper_jobstype, statusJobs by type and status
planekeeper_alerts-Total alerts
planekeeper_alerts_unacknowledged-Unacknowledged alerts
planekeeper_alerts_by_severityseverityAlerts by severity
planekeeper_releases-Total releases
planekeeper_releases_stable-Stable releases
planekeeper_releases_prerelease-Prerelease versions
planekeeper_releases_unique_artifacts-Unique artifacts
planekeeper_task_executions_24h-Task executions (24h)
planekeeper_task_executions_24h_completed-Completed tasks (24h)
planekeeper_task_executions_24h_failed-Failed tasks (24h)
planekeeper_task_executions_24h_in_progress-In-progress tasks
planekeeper_task_executions_24h_success_rate-Success rate
planekeeper_api_keys-Total API keys
planekeeper_api_keys_active-Active API keys
planekeeper_api_keys_system-System API keys
planekeeper_audit_writer_events_written_total-Total audit trail entries written (counter)
planekeeper_audit_writer_persist_errors_total-Total audit trail write failures (counter)
planekeeper_audit_writer_insert_duration_seconds_total-Cumulative audit trail insert duration in seconds (counter)

Naming Convention: Metrics use gauge type without _total suffix (per Prometheus best practices - _total is reserved for counters). Exception: audit writer metrics use counter type with _total suffix since they track cumulative totals.

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'planekeeper'
    static_configs:
      - targets: ['localhost:8443']  # Internal Traefik only
    metrics_path: '/api/v1/internal/metrics'
    # No authentication required - endpoint is on internal network
    # Default output is Prometheus format, no params needed

Important: The metrics endpoint is only accessible via the internal Traefik (port 8443). Ensure this port is restricted to trusted IPs via your hosting provider’s firewall.

Testing the Endpoint

# Prometheus format (default)
curl https://localhost:8443/api/v1/internal/metrics

# JSON format
curl https://localhost:8443/api/v1/internal/metrics?format=json

Example Prometheus Output

# HELP planekeeper_organizations Total number of organizations
# TYPE planekeeper_organizations gauge
planekeeper_organizations 5

# HELP planekeeper_organizations_active Number of active organizations
# TYPE planekeeper_organizations_active gauge
planekeeper_organizations_active 4

# HELP planekeeper_service_instances Service instances by type and status
# TYPE planekeeper_service_instances gauge
planekeeper_service_instances{service="server",status="healthy"} 2
planekeeper_service_instances{service="agent",status="healthy"} 3
planekeeper_service_instances{service="agent",status="unhealthy"} 1
planekeeper_service_instances{service="taskengine",status="healthy"} 1

# HELP planekeeper_agents Total number of registered agents
# TYPE planekeeper_agents gauge
planekeeper_agents 4

# HELP planekeeper_agents_healthy Number of healthy agents with recent heartbeat
# TYPE planekeeper_agents_healthy gauge
planekeeper_agents_healthy 3

# HELP planekeeper_agents_unhealthy Number of unhealthy agents without recent heartbeat
# TYPE planekeeper_agents_unhealthy gauge
planekeeper_agents_unhealthy 1

# HELP planekeeper_jobs Jobs by type and status
# TYPE planekeeper_jobs gauge
planekeeper_jobs{type="gather",status="pending"} 5
planekeeper_jobs{type="gather",status="completed"} 150
planekeeper_jobs{type="scrape",status="pending"} 3
planekeeper_jobs{type="scrape",status="completed"} 200
planekeeper_jobs{type="helm_sync",status="completed"} 10

# HELP planekeeper_alerts Total number of alerts
# TYPE planekeeper_alerts gauge
planekeeper_alerts 50

# HELP planekeeper_alerts_unacknowledged Number of unacknowledged alerts
# TYPE planekeeper_alerts_unacknowledged gauge
planekeeper_alerts_unacknowledged 12

# HELP planekeeper_alerts_by_severity Unacknowledged alerts by severity level
# TYPE planekeeper_alerts_by_severity gauge
planekeeper_alerts_by_severity{severity="critical"} 2
planekeeper_alerts_by_severity{severity="high"} 5
planekeeper_alerts_by_severity{severity="moderate"} 5

# HELP planekeeper_releases Total number of tracked upstream releases
# TYPE planekeeper_releases gauge
planekeeper_releases 500

# HELP planekeeper_releases_stable Number of stable (non-prerelease) releases
# TYPE planekeeper_releases_stable gauge
planekeeper_releases_stable 450

# HELP planekeeper_releases_prerelease Number of prerelease versions
# TYPE planekeeper_releases_prerelease gauge
planekeeper_releases_prerelease 50

# HELP planekeeper_releases_unique_artifacts Number of unique artifact names being tracked
# TYPE planekeeper_releases_unique_artifacts gauge
planekeeper_releases_unique_artifacts 25

# HELP planekeeper_task_executions_24h Task executions in last 24 hours
# TYPE planekeeper_task_executions_24h gauge
planekeeper_task_executions_24h 500

# HELP planekeeper_task_executions_24h_completed Completed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_completed gauge
planekeeper_task_executions_24h_completed 485

# HELP planekeeper_task_executions_24h_failed Failed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_failed gauge
planekeeper_task_executions_24h_failed 10

# HELP planekeeper_task_executions_24h_in_progress Task executions currently in progress
# TYPE planekeeper_task_executions_24h_in_progress gauge
planekeeper_task_executions_24h_in_progress 5

# HELP planekeeper_task_executions_24h_success_rate Task execution success rate (0-1)
# TYPE planekeeper_task_executions_24h_success_rate gauge
planekeeper_task_executions_24h_success_rate 0.9700

# HELP planekeeper_api_keys Total number of API keys
# TYPE planekeeper_api_keys gauge
planekeeper_api_keys 10

# HELP planekeeper_api_keys_active Number of active API keys
# TYPE planekeeper_api_keys_active gauge
planekeeper_api_keys_active 8

# HELP planekeeper_api_keys_system Number of system API keys
# TYPE planekeeper_api_keys_system gauge
planekeeper_api_keys_system 2

Example JSON Response

{
  "collected_at": "2026-02-04T12:00:00Z",
  "organizations": {
    "total": 5,
    "active": 4
  },
  "services": [
    {
      "service_name": "server",
      "total": 2,
      "healthy": 2,
      "unhealthy": 0
    },
    {
      "service_name": "agent",
      "total": 4,
      "healthy": 3,
      "unhealthy": 1
    },
    {
      "service_name": "taskengine",
      "total": 1,
      "healthy": 1,
      "unhealthy": 0
    }
  ],
  "agents": {
    "total": 4,
    "healthy": 3,
    "unhealthy": 1
  },
  "jobs": {
    "gather": {
      "pending": 5,
      "in_progress": 2,
      "completed": 150,
      "failed": 3
    },
    "scrape": {
      "pending": 3,
      "in_progress": 1,
      "completed": 200,
      "failed": 2
    },
    "helm_sync": {
      "pending": 0,
      "in_progress": 0,
      "completed": 10,
      "failed": 0
    }
  },
  "alerts": {
    "total": 50,
    "unacknowledged": 12,
    "by_severity": {
      "critical": 2,
      "high": 5,
      "moderate": 5
    }
  },
  "releases": {
    "total": 500,
    "stable": 450,
    "prerelease": 50,
    "unique_artifacts": 25
  },
  "task_executions": {
    "total": 500,
    "completed": 485,
    "failed": 10,
    "in_progress": 5,
    "success_rate": 0.97
  },
  "api_keys": {
    "total": 10,
    "active": 8,
    "system": 2
  }
}

Error Handling

StatusCondition
200Success
500Database query failure

Security

The metrics endpoint is unauthenticated but secured through network isolation:

  1. Internal Traefik Only: The endpoint is registered on the internal API path (/api/v1/internal/metrics) and only exposed via the internal Traefik reverse proxy on port 8443, which is restricted to trusted IPs by the hosting provider’s firewall.

  2. Not Publicly Routed: The public Traefik (dynamic-public.yml) does not include routing rules for /api/v1/internal/metrics.

  3. Access Methods: To access the metrics endpoint:

    • Direct (with firewall rule): curl https://<server-ip>:8443/api/v1/internal/metrics
    • From the same host: curl https://localhost:8443/api/v1/internal/metrics
    • Via SSH tunnel: ssh -L 8443:localhost:8443 user@server then curl https://localhost:8443/api/v1/internal/metrics
    • Via VPN: Route traffic to the internal port

Side Effects

None - this is a read-only endpoint.

Additional URLs

https://o11y.tools/metricslint/ https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md

13. Developer Tools

API Documentation (Swagger UI)

The server hosts interactive API documentation interfaces using Swagger UI, with separate documentation for client and internal APIs.

Endpoints:

EndpointDescription
/api/v1/swaggerClient API Swagger UI (org-scoped endpoints)
/api/v1/internal/swaggerInternal API Swagger UI (system/agent endpoints)
/api/spec/openapi-client.yamlClient API OpenAPI specification
/api/spec/openapi-internal.yamlInternal API OpenAPI specification
/api/spec/openapi-shared.yamlShared paths referenced by both specs
/api/spec/openapi.yamlCombined specification (for codegen)
/api/spec/components/*Shared component schema files

API Separation:

APIBase PathPurposeEndpoints
Client/api/v1/clientOrganization-scoped operationsJobs, releases, rules, alerts, dropdowns, org settings
Internal/api/v1/internalSystem/agent operationsHeartbeat, tasks, metrics, global settings

Client API includes:

  • Job management (gather, scrape, helm-sync)
  • Releases and versions viewing
  • Rules and alert management
  • Alert configurations
  • Dropdown data for UI forms
  • Organization-specific settings overrides

Internal API includes:

  • Agent heartbeat registration
  • Task polling and completion (for agents)
  • Prometheus metrics endpoint
  • Global settings management

Shared Paths (available in both APIs):

  • /gather-jobs/*, /scrape-jobs/*, /helm-sync-jobs/*
  • /releases/*, /versions
  • /settings (GET only in client, GET+PUT in internal)
  • /agents, /validate/regex

Usage:

  1. Start the server: go run ./app/server
  2. Navigate to:
    • http://localhost:3000/api/v1/swagger for client API docs
    • http://localhost:3000/api/v1/internal/swagger for internal API docs
  3. Click “Authorize” and enter your API key (pk_<id>_<secret>)
  4. Use “Try it out” on any endpoint to execute requests

Implementation (app/server/main.go):

// Serve API specs for Swagger UI
app.Static("/api/spec", "./api")

// Client API Swagger UI at /api/v1/swagger
app.Static("/api/v1/swagger", "./internal/static/swagger-client")

// Internal API Swagger UI at /api/v1/internal/swagger
app.Static("/api/v1/internal/swagger", "./internal/static/swagger-internal")

OpenAPI Spec Structure:

api/
├── openapi.yaml              # Master spec for codegen (references all others)
├── openapi-shared.yaml       # Shared paths (no duplication)
├── openapi-client.yaml       # Client API (references shared + client-only)
├── openapi-internal.yaml     # Internal API (references shared + internal-only)
└── components/               # Shared schemas, parameters, responses
    ├── schemas.yaml
    ├── parameters.yaml
    ├── responses.yaml
    └── securitySchemes.yaml

Code Generation: The bazel run //toolchains/oapi-codegen target bundles openapi.yaml (which references all specs) to generate server handlers and client code. The split specs are used only for Swagger UI documentation.

Files:

  • internal/static/swagger-client/index.html - Client API Swagger UI
  • internal/static/swagger-internal/index.html - Internal API Swagger UI
  • api/openapi.yaml - Master specification for codegen
  • api/openapi-client.yaml - Client API specification
  • api/openapi-internal.yaml - Internal API specification
  • api/openapi-shared.yaml - Shared paths (single source of truth)
  • api/components/ - Shared schema, parameter, and response definitions

14. Admin UI

Files: internal/handlers/, internal/templates/, internal/services/api_client.go, internal/middleware/

Purpose

The Admin UI provides server-rendered HTML interfaces for managing all Planekeeper resources. Two UI binaries exist — clientui (organization-scoped, public-facing) and internalui (system-scoped, admin-only). Both consume the REST API through an HTTP client wrapper and render pages using templ templates with Tailwind CSS.

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Browser    │────▶│   Handler    │────▶│  API Client   │────▶│   REST API   │
│              │     │  (Fiber)     │     │  (services)   │     │   (Server)   │
└──────────────┘     └──────┬───────┘     └──────────────┘     └──────────────┘
                            │
                     ┌──────┴───────┐
                     │    templ     │
                     │  Templates   │
                     │  (pages +    │
                     │  components) │
                     └──────────────┘

The UI never accesses the database directly. All data flows through the REST API via internal/services/api_client.go.

14.1 Authentication

The two UI binaries use different authentication methods:

UIAuth MethodCookieDescription
ClientUISupabase Auth (preferred) or API Key (legacy)planekeeper_session + planekeeper_orgHuman users via email/password or OAuth
InternalUIAPI Key onlyplanekeeper_api_keyAdmin users via system API key

When Supabase is not configured (SUPABASE_JWT_SECRET not set), ClientUI falls back to API key login — the same flow as InternalUI.

Supabase Auth (ClientUI)

Login/Signup Options:

  • Email/Password: Client-side Supabase JS SDK handles signInWithPassword (login) and signUp (registration), then exchanges tokens with the server via POST /auth/token-exchange
  • OAuth (auto-detected): OAuth buttons appear on both login and signup pages. On startup, ClientUI calls GET /auth/v1/settings to discover which providers are enabled in the Supabase project (e.g., GitHub, Google). Only enabled providers are shown. Supabase’s signInWithOAuth handles both sign-in and sign-up automatically — users who don’t have an account are created on first OAuth login.
  • Provider auto-detection: The OAuthProviders []string field on UIConfig is populated at startup and passed to both login and signup page templates. If the settings endpoint is unreachable, OAuth buttons are silently omitted.

Session Cookies:

  • planekeeper_session: AES-GCM encrypted blob containing access_token, refresh_token, expiry, user_id, email, supabase_id, is_approved. HTTP-only, 7-day expiry.
  • planekeeper_org: Active organization ID (plain int64). Validated against membership on every request.

Login Flow (email/password):

POST /auth/login (email + password)
    │
    ▼
Server calls Supabase /auth/v1/token?grant_type=password
    │
    ▼
Find or create user in DB (by supabase_id or email)
    │  (new users created with is_approved = FALSE)
    │
    ▼
Set planekeeper_session cookie (encrypted, includes is_approved)
    │
    ├── User approved (is_approved = TRUE)?
    │   ├── NO  → Redirect to /pending-approval
    │   └── YES ↓
    │
    ├── User has org memberships?
    │   ├── YES → Set planekeeper_org cookie → Redirect to /dashboard
    │   └── NO  → Redirect to /onboarding

OAuth Flow (client-side, via Supabase JS SDK):

User clicks OAuth button on /login or /auth/signup
    │
    ▼
Supabase JS SDK calls signInWithOAuth({ provider, redirectTo })
    │
    ▼
Browser redirects to Supabase /auth/v1/authorize → provider (e.g., GitHub)
    │
    ▼
User authenticates with provider
    │
    ▼
Provider redirects back to Supabase → Supabase redirects to AUTH_CALLBACK_URL
    │
    ▼
GET /auth/callback (tokens in URL hash)
    │
    ▼
Callback page JS extracts tokens → POST /auth/token-exchange
    │
    ▼
Server validates JWT, finds/creates user → Set session → Dashboard or onboarding

Onboarding (first login, no org memberships):

GET /onboarding
    │
    ├── Check pending invites by email
    │
    ▼
Show "Create Organization" form + pending invite list
    │
    ├── POST /onboarding/create-org → Create org + owner membership → Dashboard
    └── POST /onboarding/accept-invite/:token → Create membership → Dashboard

Token Refresh: The auth middleware transparently refreshes expired JWTs using the stored refresh token. Updated tokens are re-encrypted and written back to the session cookie.

Middleware (pkg/auth/middleware.go):

  • RequireAuth: Validates session cookie, refreshes expired JWT, sets context locals (user_id, email, supabase_id)
  • RequireOnboarded: Checks user is approved (redirects to /pending-approval if not), then checks org membership and active org cookie (redirects to /onboarding if not)

API Key Auth (InternalUI + Legacy ClientUI)

Login Flow:

  1. User navigates to /login (unauthenticated)
  2. Enters API key in form
  3. Handler validates key, stores in HTTP-only cookie planekeeper_api_key (24-hour expiry)
  4. All subsequent requests include cookie automatically

Middleware (internal/middleware/api_key.go):

  • Checks X-API-Key header first, then planekeeper_api_key cookie
  • On failure: redirects browser requests to login page, returns 401 for API requests
  • ClientUI requires organization-scoped keys
  • InternalUI requires system-scoped keys (organization_id = 0)

API Client Construction

All handlers use a shared helper to construct API clients from the current auth context:

getAPIClient(c, cfg)
    │
    ├── Has Supabase session? → NewAPIClientWithJWT(accessToken, orgID)
    │   (uses Authorization: Bearer + X-Organization-Id headers)
    │
    └── Has API key? → NewAPIClient(apiKey)
        (uses X-API-Key header)

This dual-path helper allows all handlers to work identically regardless of auth method.

14.2 Navigation Structure

Client UI (organization-scoped resources):

SectionPageRoute
OverviewDashboard/dashboard
JobsGather Jobs/jobs
JobsScrape Jobs/scrape-jobs
DataReleases/releases
RulesMonitoring Rules/rules
RulesAlert Configs/alert-configs
MonitoringAlerts/alerts
NotificationsChannels/notification-channels
NotificationsRules/notification-rules
NotificationsSettings/notification-settings
NotificationsDeliveries/notification-deliveries

Internal UI (global/system resources):

SectionPageRoute
JobsGather Jobs/jobs
SystemAgents/agents
SystemSettings/settings

14.3 Page Patterns

Every list page follows a consistent pattern:

┌─────────────────────────────────────────┐
│  Title                    [Create] btn  │
├─────────────────────────────────────────┤
│  Success/Error banner (from query param)│
├─────────────────────────────────────────┤
│  Inline create form (if ?new=true)      │
├─────────────────────────────────────────┤
│  ┌─────────────────────────────────┐    │
│  │  Table with headers             │    │
│  │  Row 1  |  Row 2  |  Actions   │    │
│  │  ...                            │    │
│  │  "No items found" (if empty)    │    │
│  ├─────────────────────────────────┤    │
│  │  Pagination (Prev | Next)       │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘

Every detail/edit page follows:

┌─────────────────────────────────────────┐
│  ← Back to List         Page Title      │
├─────────────────────────────────────────┤
│  Success/Error banner                   │
├─────────────────────────────────────────┤
│  Edit form with fields                  │
│  [Cancel]  [Save Changes]              │
├─────────────────────────────────────────┤
│  Danger Zone: [Delete]                  │
├─────────────────────────────────────────┤
│  Created: YYYY-MM-DD | Updated: ...     │
└─────────────────────────────────────────┘

14.4 Form Handling Flow

Standard CRUD Flow:

GET /resource?new=true          → Render list page with inline create form
POST /resource                  → Parse form, call API, redirect with ?success= or ?error=
GET /resource/{id}              → Render detail/edit page
POST /resource/{id}             → Parse form, call API, redirect with result
POST /resource/{id}/delete      → Call API, redirect to list with result
POST /resource/{id}/toggle      → Toggle active state (HTMX or redirect)

Error Message Flow:

API Response (JSON)  →  Services Layer (extract fields)  →  Handler (format message)  →  URL redirect (?error=encoded_msg)

Errors are passed as URL query parameters for stateless handling across redirects:

/notification-channels/1?error=Delivery+failed+with+HTTP+400
/rules?success=Rule+created+successfully

Error Sanitization (internal/handlers/form_helpers.go):

  • Internal errors (connection refused, context deadline, database constraint violations) are mapped to generic user-friendly messages
  • Message length capped at 100 characters
  • Prevents leaking sensitive database/system details to the UI

Graceful Degradation: When API calls fail, pages render with empty data rather than showing error pages. The dashboard, for example, renders empty metrics if GetDashboardStats fails.

14.5 HTMX Integration

Several pages use HTMX for partial page updates without full reloads:

FeaturePatternPages
Active togglehx-post to toggle endpoint, hx-target replaces row, hx-swap="outerHTML"Rules, alert configs, notification channels, notification rules
Settings edithx-get loads inline edit form, hx-put saves, hx-target replaces cellSettings
Alert acknowledgehx-post to acknowledge endpoint, replaces alert rowAlerts

Toggle Handler Logic:

if request has "HX-Request" header:
    → render and return updated table row only (outerHTML swap)
else:
    → redirect to list page with success message

14.6 Scope Filtering

The dashboard and jobs pages support multi-tenant scope filtering via a dropdown:

ScopeLabelBehavior
orgOrganizationShows only the current organization’s resources
globalGlobalShows only globally-shared resources
allAllShows both organization and global resources

Default Scopes:

  • Dashboard defaults to org (most relevant view for operators)
  • Jobs page defaults to org (consistent with dashboard; operators can switch to “all” to see global jobs)

The scope parameter is passed through to the API’s list endpoints which apply the corresponding SQL filter (see Section 10: Multi-Tenancy Model).

14.7 Pagination

List pages support offset-based pagination via query parameters:

ParameterDefaultValidation
limit50Must be 1-100
offset0Must be ≥ 0

The pagination component renders Previous/Next links and a “Showing X-Y” counter. It determines whether a “Next” link is needed by checking if itemCount == limit (indicating more items may exist).

14.8 Shared Component Library

Reusable templ components in internal/templates/components/ enforce consistent UI patterns across all pages:

ComponentPurposeBusiness Logic
ScopeFilterScope dropdown + optional searchRenders All/Organization/Global options, preserves current selection
ActiveToggleHTMX toggle badgeGreen “Active” / gray “Inactive”, posts to toggle endpoint
ActionCellEdit + Delete buttonsEdit link + delete form with confirmation dialog
EmptyRowEmpty state messageSpans configurable number of table columns
FormButtonsCancel + Submit pairCancel links back to list, submit posts form
DetailPageHeaderBack link + titleConsistent navigation on detail pages
TimestampsCreated/Updated footerFormats as YYYY-MM-DD HH:MM:SS
FormCardCard wrapper with titleWhite card with close button, uses slot pattern
PaginationPage navigationPrevious/Next with offset arithmetic
MetricCardDashboard stat cardLarge number with label and icon
StatusBadgeJob status indicatorColor-coded: pending=amber, in_progress=blue, completed=green, failed=red
SeverityBadgeAlert severity indicatorcritical=red, high=orange, moderate=yellow
RuleTypeBadgeRule type indicatordays_behind=purple, majors_behind=indigo, minors_behind=cyan
HealthBadgeAgent health indicatorgreen “Healthy” / red “Unhealthy”
ChannelTypeBadgeChannel type indicatorwebhook=purple, pagerduty=green, telegram=blue, smtp=yellow
DeliveryStatusBadgeDelivery status indicatorpending=yellow, succeeded=green, failed=red, dead_letter=dark
EventTypeBadgeAlert event indicatorcreated=blue, escalated=orange, acknowledged=green, resolved=purple
BoolBadgeGeneric yes/no indicatorConfigurable true/false labels with green/gray colors
ErrorMessageError bannerRed banner, renders only when message is non-empty
SuccessMessageSuccess bannerGreen banner, renders only when message is non-empty

14.9 Page-Specific Business Logic

Dashboard: Aggregates job statistics (total, pending, completed) and system health into four metric cards. Displays recent gather jobs in a table sorted by creation date.

Gather Jobs: Form validation includes name, artifact name, source type, cron schedule, tag filter, and version regex. Supports “Run Now” to trigger immediate execution and “Clear Releases” to purge cached upstream data.

Scrape Jobs: Displays the latest version snapshot inline with each job row. Form includes regex validation endpoint. Version history page shows historical snapshots with configurable limit (1-20).

Releases: Two view modes — flat list and grouped by artifact. Supports filters for artifact name (autocomplete from known artifacts), version text, sort order (newest/oldest first), and prerelease inclusion toggle. Summary bar shows total count, unique artifacts, and stable release count.

Rules: Three rule types with tiered thresholds (moderate ≤ high ≤ critical). Threshold values are contextual — “days” for days_behind, “versions” for majors/minors_behind. “Evaluate All” button triggers rule evaluation across all active alert configs.

Alert Configs: Links three resources (scrape job + gather job + rule) into a monitoring configuration. Form dropdowns are populated dynamically from available resources. Displays rule type badge alongside rule name.

Alerts: Filters by acknowledgment status and severity. Summary panel shows counts by severity. Supports single and bulk acknowledgment. Table rows are color-coded by severity (red border for critical, orange for high, yellow for moderate).

Notification Channels: Channel detail page includes “Test Channel” button that sends a test webhook and displays detailed results (HTTP status, response preview, latency). Event-specific template editor with toggle to enable/disable, showing inherited global defaults as collapsible previews.

Notification Settings: Organization-level default channel selection. Per-category template management (new_alert, acknowledged, resolved) with “Reset to Global” option for organization overrides.

Settings: Combined view showing global defaults alongside organization overrides. Inline HTMX editing — click “Set Override” to enter edit mode, “Clear” to remove override and fall back to global default.


15. Open Questions & Ambiguities

1. Retry Exhaustion Recovery

Issue: When a job reaches max_attempts and enters failed status, there is no automatic recovery mechanism.

Current Behavior: Job remains in failed status indefinitely until manual intervention.

Possible Solutions:

  • Manual trigger via TriggerJob endpoint
  • Admin UI “retry” button
  • Automatic reset after configurable cooldown period

2. Token Expiry vs Stale Reset Overlap

Issue: Execution tokens expire after ~5 minutes (configurable), but stale job reset happens after 1 hour.

Scenario:

  1. Agent claims job, gets 5-minute token
  2. Agent crashes at minute 3
  3. Token expires at minute 5
  4. Job remains in_progress until minute 60 (stale reset)

Question: Should there be intermediate recovery (e.g., 15-minute token expiry detection)?

3. Global Jobs Organization Association

Issue: Global jobs (is_global = true) have organization_id = NULL, but releases created by these jobs use organization_id = 1 (Global org).

Implication: Query logic must account for both NULL and 1 when listing global releases.

4. Minors Behind Formula Discrepancy

Issue: The formula-based calculation for minors_behind may differ from release-list counting.

Example:

  • Formula: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind
  • Release-list might show: 7.0, 7.1, 7.2, 8.0, 8.1 = 5 minors behind

Question: Should formula fallback be documented as “approximate” in alerts?

5. Alert History Retention

Current Behavior: Resolved alerts are soft-deleted and preserved indefinitely.

Potential Improvements:

  • Add configurable retention period (e.g., 90 days)
  • Add PurgeOldResolvedAlerts cleanup job to taskengine
  • Consider archival to separate table for very old alerts

Note: The PurgeOldResolvedAlerts query exists but is not currently called by any scheduled job.


16. Regression Test Recommendations

Gather Jobs Domain

Test CaseDescription
TestGatherJob_GitHub_RateLimitHandlingVerify proper error message and retry behavior on 403/429
TestGatherJob_GitHub_PaginationEnsure all 1000 releases fetched across 10 pages
TestGatherJob_Helm_LargeIndexVerify 50MB limit enforced, graceful error
TestGatherJob_VersionRegex_CaptureGroupConfirm capture group extraction vs full match
TestGatherJob_StateTransition_MaxAttemptsVerify pending→failed after max_attempts
TestGatherJob_Reschedule_CronExpressionVerify next_run_at calculation accuracy
TestGatherJob_OrphanRecoveryJobs released when agent disconnects

Scrape Jobs Domain

Test CaseDescription
TestScrapeJob_CredentialAssignmentOnly agents with credential receive job
TestScrapeJob_Parser_YQ_ArrayIndexing.dependencies[0].version works
TestScrapeJob_Parser_Regex_NoMatchGraceful error when pattern doesn’t match
TestScrapeJob_VersionTransform_AllAll 5 transforms work correctly
TestScrapeJob_HistoryLimit_CleanupOld snapshots deleted when limit exceeded
TestScrapeJob_TriggerRuleEvaluationAsync rule evaluation triggered on success

Task Execution System

Test CaseDescription
TestTask_SkipLocked_ConcurrentClaimTwo agents don’t get same job
TestTask_TokenExpiry_ReturnsConflict409 returned for expired token
TestTask_IdempotentCompletionSame token submitted twice returns 202
TestTask_ResultProcessingFailureAgent gets 202 even if processing fails
TestTask_CapabilityFilteringAgent without capability doesn’t receive job type

Rules Engine

Test CaseDescription
TestRule_DaysBehind_VersionNotFoundCRITICAL with -1 behindBy
TestRule_MajorsBehind_VersionParseFailCRITICAL on semver error
TestRule_MinorsBehind_ReleaseListVsFormulaBoth methods produce reasonable results
TestRule_StableOnly_SkipsPrereleasealpha/beta/rc excluded from latest
TestRule_ThresholdTiers_HighestWinsCritical returned when >= critical threshold

Alert System

Test CaseDescription
TestAlert_OnePerConfigOnly one active alert per config allowed
TestAlert_UpdateInPlaceVersion change updates existing alert
TestAlert_AckResetOnVersionChangeAck cleared only when discovered version changes
TestAlert_AckPreservedOnSameVersionAck preserved when same version re-evaluated
TestAlert_SoftDelete_SetsResolvedAtResolution sets resolved_at, doesn’t delete
TestAlert_SoftDelete_PreservesHistoryResolved alerts accessible via /alerts/resolved
TestAlert_SoftDelete_NewAlertAfterResolveNew alert can be created after previous resolved
TestAlert_AutoTrigger_OnConfigCreateEvaluation runs after config creation
TestAlert_AutoTrigger_OnScrapeSuccessEvaluation runs after scrape completes
TestAlert_AutoTrigger_OnGatherSuccessEvaluation runs after gather completes
TestAlert_Resolution_NotifiesWebhookalert.resolved event dispatched on resolve
TestAlert_BulkAcknowledge_OnlyActiveBulk ack only affects active alerts
TestAlert_ListResolved_Pagination/alerts/resolved supports limit/offset
TestAlert_ListResolved_SeverityFilter/alerts/resolved filters by severity

Notification System

Test CaseDescription
TestNotification_RuleMatching_SeverityFilterRule with severity filter only matches specified severities
TestNotification_RuleMatching_EventFilterRule with event filter only matches specified events
TestNotification_RuleMatching_CatchAllEmpty filters match all severities and events
TestNotification_RuleMatching_PriorityHigher priority rules evaluated first
TestNotification_Dispatch_ChannelDedupSame alert doesn’t notify same channel twice
TestNotification_Dispatch_OrgDefaultFalls back to org default when no rule matches
TestNotification_Dispatch_RepeatIntervalSkips notification within repeat interval
TestNotification_Delivery_SkipLockedMultiple notifiers don’t claim same delivery
TestNotification_Delivery_StatusTransitionsCorrect state machine: pending→in_progress→succeeded/failed
TestNotification_Retry_ExponentialBackoffDelays increase with attempts
TestNotification_Retry_RetryAfterHeaderHonors 429 Retry-After header
TestNotification_Retry_DeadLetterMoves to dead_letter after max attempts
TestNotification_Retry_NonRetryable4xx4xx (except 429) goes to dead_letter immediately
TestNotification_Webhook_HMACSignatureCorrect HMAC-SHA256 signature in header
TestNotification_Webhook_IdempotencyKeySame key across retries
TestNotification_Webhook_SSRFProtectionPrivate IPs blocked by default
TestNotification_Ack_TokenValidationToken lookup, expiry, and single-use
TestNotification_Ack_AlertUpdateAlert marked acknowledged via callback
TestNotification_Housekeeping_ExpireTokensExpired tokens deleted
TestNotification_Housekeeping_CleanupDeliveriesStuck deliveries moved to dead_letter

Multi-Tenancy

Test CaseDescription
TestTenancy_OrgIsolationOrg A can’t see Org B resources
TestTenancy_GlobalVisibilityAll orgs see global resources
TestTenancy_SystemKeyCreatesGlobalSystem key creates is_global=true
TestTenancy_ScopeParameterorg/global/all filters work correctly

Agent Communication

Test CaseDescription
TestAgent_Heartbeat_UpdatesLastSeenTimestamp updated on heartbeat
TestAgent_Heartbeat_StoresCapabilitiesMetadata includes capabilities JSON
TestAgent_OrphanCleanup_DisconnectedAgentJobs reset when agent removed
TestAgent_OrphanCleanup_StartupDelayNo cleanup in first 30 seconds

Metrics API

Metrics api should comply with the OpenMetrics standard. See: https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md

Test CaseDescription
TestMetrics_ContentNegotiation_JSONDefault Accept header returns JSON
TestMetrics_ContentNegotiation_PrometheusAccept: text/plain returns Prometheus format
TestMetrics_NoAuthRequiredEndpoint accessible without API key
TestMetrics_SystemWideOrganizationCountsReturns total and active org counts
TestMetrics_ServiceInstancesByTypeReturns service instances grouped by type
TestMetrics_JobCountsAllTypesIncludes gather, scrape, and helm_sync jobs
TestMetrics_AgentHealth_5MinuteThresholdAgents without heartbeat in 5 min marked unhealthy
TestMetrics_TaskExecutions_24HourWindowOnly includes executions from last 24 hours
TestMetrics_TaskExecutions_SuccessRateCalculates success rate correctly
TestMetrics_APIKeyCountsReturns total, active, and system key counts
TestMetrics_PrometheusFormat_NoOrgLabelsPrometheus output has no org_id labels
TestMetrics_PrometheusFormat_HelpAndTypeAll metrics have HELP and TYPE declarations

Admin UI

Test CaseDescription
TestUI_Login_ValidKeyValid API key sets cookie and redirects to dashboard
TestUI_Login_InvalidKeyInvalid key shows error on login page
TestUI_Logout_ClearsCookieLogout clears auth cookie and redirects to login
TestUI_Dashboard_DefaultScopeDashboard defaults to org scope
TestUI_Dashboard_ScopeFilterScope parameter filters jobs correctly
TestUI_Dashboard_GracefulDegradationDashboard renders with empty data when API fails
TestUI_ListPage_PaginationLimit/offset query params paginate results
TestUI_ListPage_EmptyStateEmpty table shows “No items found” message
TestUI_CreateForm_ValidationMissing required fields redirect with error message
TestUI_CreateForm_SuccessValid form redirects with success message
TestUI_Toggle_HTMXToggle with HX-Request header returns updated row only
TestUI_Toggle_NonHTMXToggle without HTMX redirects to list page
TestUI_Delete_ConfirmationDelete form includes confirmation dialog
TestUI_ErrorSanitizationInternal errors mapped to generic user-friendly messages

Appendix: Key SQL Functions Reference

Gather Jobs

FunctionPurpose
CreateGatherJobInsert new job with defaults
ClaimNextPendingGatherJobAtomic claim with SKIP LOCKED
CompleteGatherJobMark successful completion
FailGatherJobIncrement attempts, possibly transition to failed
RescheduleGatherJobCalculate and set next_run_at
ResetStaleGatherJobsReset jobs in_progress > 1 hour
ResetOrphanedGatherJobsReset jobs claimed by dead agents

Scrape Jobs

FunctionPurpose
CreateScrapeJobInsert new job with defaults
ClaimNextPendingScrapeJobWithCredentialsCredential-aware atomic claim
CompleteScrapeJobMark successful completion
FailScrapeJobIncrement attempts, possibly transition to failed
RescheduleScrapeJobCalculate and set next_run_at

Version Snapshots

FunctionPurpose
CreateVersionSnapshotInsert new snapshot (no ON CONFLICT - full history)
GetVersionSnapshotGet snapshot by ID
GetVersionSnapshotByScrapeJobGet latest snapshot for job (ORDER BY id DESC)
GetLatestVersionSnapshotGet latest version for rules evaluation (ORDER BY id DESC)
ListVersionHistoryByScrapeJobList version history with limit
CountVersionSnapshotsByScrapeJobCount snapshots for history display
DeleteOldVersionSnapshotsPurge snapshots beyond retention limit

Note: All “latest” queries use ORDER BY id DESC instead of ORDER BY discovered_at DESC to guarantee insertion order regardless of timestamp precision.

Alerts

FunctionPurpose
UpsertAlertCreate or update active alert (one per config)
GetAlertByIDGet alert with joined config/rule data
GetAlertByConfigIDGet active alert for specific config
ListAlertsByOrganizationList active alerts with filters
CountAlertsByOrganizationCount active alerts with filters
ListResolvedAlertsList resolved (historical) alerts
CountResolvedAlertsCount resolved alerts
AcknowledgeAlertMark active alert as acknowledged
BulkAcknowledgeAlertsBatch acknowledge by ID array
UnacknowledgeAlertClear acknowledgement on active alert
ResolveAlertSoft delete: set resolved_at timestamp
ResolveAlertsForConfigSoft delete all active alerts for config
GetAlertSummaryCount active alerts by severity and ack status
PurgeOldResolvedAlertsHard delete resolved alerts older than N days

Notification Channels

FunctionPurpose
CreateNotificationChannelInsert new channel with config
GetNotificationChannelGet channel by ID with org check
ListNotificationChannelsList channels for organization
UpdateNotificationChannelUpdate channel settings
DeleteNotificationChannelRemove channel
ToggleNotificationChannelToggle active state

Notification Rules

FunctionPurpose
CreateNotificationRuleInsert new rule with filters
GetNotificationRuleGet rule by ID with org check
ListNotificationRulesList rules for organization (ordered by priority)
UpdateNotificationRuleUpdate rule settings
DeleteNotificationRuleRemove rule
ToggleNotificationRuleToggle active state
GetMatchingRulesForAlertGet rules matching severity and event type

Notification Deliveries

FunctionPurpose
CreateNotificationDeliveryInsert new delivery record
ClaimPendingDeliveriesAtomic claim with SKIP LOCKED
MarkDeliverySucceededUpdate status to succeeded
MarkDeliveryFailedUpdate status and schedule retry
ListDeliveriesByOrganizationList deliveries with filters
ListDeliveriesByAlertList deliveries for specific alert
ListDeadLettersList dead letter deliveries
ResetDeliveryForRetryReset dead letter for retry
RecentSuccessInGroupCheck for recent success in group interval

Notification Ack Tokens

FunctionPurpose
CreateAckTokenInsert new acknowledgment token
GetAckTokenByTokenLookup token by value
MarkAckTokenUsedMark token as used
CleanupExpiredAckTokensDelete expired tokens

System Metrics

FunctionPurpose
GetSystemMetricsOrganizationCountsTotal and active organization counts
GetSystemMetricsServiceInstancesService instances by type with health status
GetSystemMetricsAgentCountsAgent counts with health status (system-wide)
GetSystemMetricsGatherJobCountsGather job counts by status (system-wide)
GetSystemMetricsScrapeJobCountsScrape job counts by status (system-wide)
GetSystemMetricsHelmSyncJobCountsHelm sync job counts by status (system-wide)
GetSystemMetricsAlertSummaryAlert counts with severity breakdown (system-wide)
GetSystemMetricsReleaseSummaryRelease counts (system-wide)
GetSystemMetricsTaskExecutionsTask execution stats for last 24 hours (system-wide)
GetSystemMetricsAPIKeyCountsAPI key counts (total, active, system)

Document generated: 2026-02-05 Source: Planekeeper codebase analysis