Planekeeper Business Logic Documentation

A comprehensive guide to the business logic, workflows, and domain rules that power Planekeeper’s automated software version staleness detection system.

Executive Summary
Gather Jobs Domain
Scrape Jobs Domain
Helm Sync Jobs Domain
Task Execution System
Background Scheduler
Rules Engine
Alert System
Notification System
Multi-Tenancy Model
Agent Communication
Metrics API
Developer Tools
Admin UI
Open Questions & Ambiguities
Regression Test Recommendations

1. Executive Summary

System Purpose

Planekeeper is an automated software version staleness detection system. It monitors deployed software versions against upstream releases to identify when software falls behind, applying configurable rules to generate severity-graded alerts.

Architecture Triad

┌─────────────────────────────────────────────────────────────────┐
│                           SERVER                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌────────────────┐  │
│  │ REST API │  │ Admin UI │  │ Heartbeat │  │ Orphan Cleanup │  │
│  │ (Fiber)  │  │ (templ)  │  │  Service  │  │    Service     │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └───────┬────────┘  │
│       └──────────────┴─────────────┴────────────────┘            │
│                            │                                     │
│                   ┌────────┴────────┐                            │
│                   │   PostgreSQL    │                            │
│                   │   (sqlc/goose)  │                            │
│                   └────────┬────────┘                            │
└────────────────────────────┼────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│    AGENT 1    │    │  TASKENGINE   │    │    AGENT N    │
│ ┌───────────┐ │    │ ┌───────────┐ │    │ ┌───────────┐ │
│ │ Gatherer  │ │    │ │ Scheduler │ │    │ │  Scraper  │ │
│ │ (GitHub)  │ │    │ │  (cron)   │ │    │ │ (git+YQ)  │ │
│ └───────────┘ │    │ ├───────────┤ │    │ └───────────┘ │
│ ┌───────────┐ │    │ │ Processor │ │    └───────────────┘
│ │  Scraper  │ │    │ │ (results) │ │
│ └───────────┘ │    │ └───────────┘ │
└───────────────┘    └───────────────┘

Server: Hosts REST API, Admin UI, runs database migrations, manages heartbeat detection and orphan cleanup.

Agent: Polls for tasks, executes gather (fetch upstream releases) and scrape (extract deployed versions) jobs.

TaskEngine: Handles job scheduling, timeout detection, cron-based rescheduling, and result processing.

Core Workflow

┌──────────────┐     ┌───────────────┐     ┌──────────────┐     ┌─────────────┐
│ GATHER JOBS  │────▶│  SCRAPE JOBS  │────▶│ RULES ENGINE │────▶│   ALERTS    │
│              │     │               │     │              │     │             │
│ Fetch latest │     │ Extract       │     │ Calculate    │     │ Create/     │
│ releases     │     │ deployed      │     │ behind-by    │     │ update      │
│ from GitHub/ │     │ versions      │     │ values       │     │ alerts      │
│ Helm repos   │     │ from repos    │     │              │     │             │
└──────────────┘     └───────────────┘     └──────────────┘     └─────────────┘
       │                    │                    │                    │
       ▼                    ▼                    ▼                    ▼
 upstream_releases   version_snapshots     alert_configs          alerts
    (table)              (table)             (table)             (table)

2. Gather Jobs Domain

Files: pkg/gatherer/github.go, pkg/gatherer/helm.go, pkg/gatherer/oci.go, pkg/gatherer/endoflife.go, pkg/api/gather_jobs_handler.go, pkg/storage/queries/gather_jobs.sql

Purpose

Fetch upstream releases from external sources to establish the “latest available version” baseline for staleness detection. ~173 global gather jobs are pre-seeded (migration 034) to provide immediate coverage of common infrastructure software.

Source Types

Source Type	Example Artifact Name	Description
`github_releases`	`github.com/kubernetes/kubernetes`	GitHub releases via REST API
`helm_repository`	`argoproj.github.io/argo-helm/argo-cd`	Helm chart versions via index.yaml
`oci_registry`	`docker.io/library/nginx`	OCI container image tags via registry API
`endoflife_date`	`endoflife.date/python`	Product lifecycle data from endoflife.date

Input Parameters

Field	Required	Description
`artifact_name`	Yes	Source identifier (owner/repo for GitHub, repo-url/chart-name for Helm)
`source_type`	Yes	`github_releases` or `helm_repository`
`name`	No	Human-readable job name
`schedule`	No	Cron expression for recurring execution
`tag_filter`	No	Regex to filter which tags to include
`version_regex`	No	Regex with capture group to extract version from tag
`source_config`	No	JSONB for source-specific configuration
`labels`	No	JSONB key-value pairs for categorization (e.g., `{"category": "kubernetes"}`)
`is_global`	No	Create as global resource (system keys only)

Business Rules

Tag Filtering:

Optional regex pattern applied to release tags
Non-matching tags are excluded from results
Invalid regex patterns logged as warnings, fall back to no filtering

Version Extraction:

If version_regex has capture group → use first captured group
If version_regex matches but no capture group → use full match
Fallback: Strip leading v or V prefix (GitHub) or return as-is (Helm)

Pagination Limits:

GitHub: 100 releases per page, max 10 pages (1,000 releases total)
Helm: Max 1,000 versions per chart, 50 MB index file limit

Prerelease Detection (Helm): Versions containing these patterns (case-insensitive) are marked as prereleases:

-alpha, -beta, -rc, -dev, -preview, -snapshot
.alpha, .beta, .rc, .dev, .preview, .snapshot

State Machine

                    ┌──────────────────────────────────────────────────────────┐
                    │                                                          │
                    ▼                                                          │
┌─────────┐    ┌─────────┐    ┌────────┐    ┌───────────┐    ┌────────┐       │
│ CREATED │───▶│ PENDING │───▶│ QUEUED │───▶│IN_PROGRESS│───▶│COMPLETED│──────┘
└─────────┘    └─────────┘    └────────┘    └───────────┘    └────────┘  (reschedule
                    ▲              ▲              │                       if cron)
                    │              │              │
                    │              │   ┌──────────┴──────────────┐
                    │              │   │                         │
                    │              │   ▼                         ▼
                    │              │ attempts < max         attempts >= max
                    │              │   │                         │
                    │              │   │                         ▼
                    │              └───┘                    ┌────────┐
                    │        (retry with backoff)           │ FAILED │
                    │                                       └────────┘
                    │
                    └── (reschedule / stale reset / orphan reset)

pending: Job is scheduled for the future (next_run_at > now) or waiting for retry_after
queued: Job is eligible for agent pickup (next_run_at <= now, retry_after passed)
in_progress: Agent has claimed and is executing the job

Key Transitions:

Transition	Trigger	SQL Function
→ pending	Job created (future schedule)	`CreateGatherJob`
→ queued	Job created (immediate run)	`TriggerGatherJobNow`
pending → queued	Schedule time reached	`TransitionPendingToQueuedGatherJobs`
queued → in_progress	Agent claims job	`ClaimNextQueuedGatherJob` (SKIP LOCKED)
in_progress → completed	Agent reports success	`CompleteGatherJob`
in_progress → queued	Agent reports failure (retries remain)	`FailGatherJob`
in_progress → failed	Agent reports failure (max attempts)	`FailGatherJob`
completed → pending	Cron schedule triggers	`RescheduleGatherJob`
in_progress → queued	Job stale >1 hour	`ResetStaleGatherJobs`
* → queued	Claimed agent disconnected	`ResetOrphanedGatherJobs`

Error Handling

Exponential Backoff:

retry_delay = 2^(attempts+1) seconds

Attempt 1: 4 seconds
Attempt 2: 8 seconds
Attempt 3: 16 seconds
...

GitHub-Specific Errors:

HTTP Status	Error Message
401	`GitHub authentication failed: invalid or missing token`
403 (rate limit)	`GitHub rate limit exceeded: resets at <timestamp>`
403 (other)	`GitHub access forbidden: repository may be private`
404	`GitHub repository not found: <owner/repo>`
429	`GitHub secondary rate limit hit: retry after N seconds`

Side Effects

On Success: Upsert releases into upstream_releases table (conflict on artifact_name + version)
On Reschedule: Calculate next run time using cron expression
On Config Update: Delete all existing releases (orphaned by config change)

Incremental Sync (GitHub)

GitHub gather jobs support incremental sync to reduce API calls and improve performance for repositories with many releases.

How it works:

First run (full sync): The gatherer fetches all releases by paginating through the GitHub Releases API (up to 100 pages × 100 per page = 10,000 releases max). After completion, the processor writes sync state to gather_jobs.sync_state:
- full_sync_complete: true if all releases were fetched without hitting the page limit
- releases_fetched: total count from this run
- last_full_sync_at: current timestamp (only set for full syncs)
Subsequent runs (incremental): If full_sync_complete is true and last_full_sync_at is within the full sync interval (default: 2 weeks), the dispatcher injects an _incremental_since hint into the agent’s source config. The GitHub gatherer uses this date to stop pagination early — once it encounters a release older than or equal to the hint date, it stops fetching additional pages.
Periodic full sync: When last_full_sync_at exceeds the configured GATHER_FULL_SYNC_INTERVAL (default: 336h / 2 weeks), the dispatcher omits the incremental hint, forcing a full re-fetch. This catches edited releases, backdated tags, or metadata changes that incremental mode would miss.
Sync state reset: Sync state is reset to {} when:
- A gather job’s configuration is updated (via PUT /gather-jobs/{id})
- A gather job’s releases are cleared (via POST /gather-jobs/{id}/clear-releases)

Scope: Only github_releases source type supports incremental sync. Helm, OCI, and endoflife.date gatherers always do full fetches (they don’t paginate the same way).

3. Scrape Jobs Domain

Files: pkg/scraper/scraper.go, pkg/parser/*.go, pkg/api/scrape_jobs_handler.go, pkg/storage/queries/scrape_jobs.sql

Purpose

Extract deployed software versions from Git repositories by parsing configuration files (YAML, JSON, or text) to establish the “currently deployed version” for staleness comparison.

Input Parameters

Field	Required	Description
`repository_url`	Conditional	Git repository URL (HTTPS or SSH; not required for `manual`)
`target_file`	Conditional	Path to file containing version (e.g., `Chart.yaml`; not required for `manual`)
`parse_type`	Yes	Parser type: `yq`, `jq`, `regex`, or `manual`
`parse_expression`	Conditional	Parser-specific expression (not required for `manual`)
`ref`	Conditional	Git ref to checkout (default: `main`; not required for `manual`)
`credential_name`	No	Named credential for authentication
`schedule`	No	Cron expression for recurring execution
`version_transform`	No	Post-parse transformation
`history_limit`	No	Max version snapshots to retain (1-20)

Parser Implementations

YQ Parser (YAML)

Expression Format: Dot-notation with array indexing

.version                    → Simple field
.metadata.version           → Nested path
.dependencies[0].version    → Array access
.items[2].name              → Nested array access

Features:

Supports both map[string]any and map[any]any YAML structures
Uses gopkg.in/yaml.v3 for parsing

JQ Parser (JSON)

Expression Format: Dot-notation (no array indexing)

.version                    → Simple field
.dependencies.react         → Nested path

Limitations: Does not support array indexing (use YQ for JSON with arrays)

Regex Parser (Text)

Expression Format: Go regex pattern

version:\s*([\d.]+)         → Captures version after "version:"
^v(\d+\.\d+\.\d+)$          → Captures semver from full line

Behavior: Returns first capture group if present, otherwise full match

Manual Entry (No Agent)

Manual parse type allows users to enter version strings directly via the API or UI, bypassing the agent-based scraping pipeline entirely.

Behavior:

No repository is cloned; no file is parsed
Placeholder values are stored: repository_url="manual://", target_file="-", parse_expression="-"
Jobs are created with status='completed' (never pending — agents never pick them up)
Version is set via POST /scrape-jobs/{id}/set-version with a version string in the request body
Version transforms are applied to the manually entered version before storage
A version snapshot is created and rule evaluation is triggered, identical to agent-completed jobs

Use Cases:

Demo and testing environments (no agent infrastructure needed)
Manual version tracking for systems that can’t be scraped
Quick setup to exercise the full alert pipeline (rules, alerts, notifications)

Key Differences from Agent-Based Jobs:

Aspect	Agent-Based	Manual
Task claiming	Polled by agents via `SKIP LOCKED`	Never enters task queue
Initial status	`pending`	`completed`
“Run Now” button	Triggers agent execution	Not available
Version entry	Automatic (parser output)	Via set-version API/UI
Required fields	repo URL, ref, target file, expression	Parse type only

Version Transforms

Transform	Example Input	Output
`none`	`1.2.3`	`1.2.3`
`add_v_lower`	`1.2.3`	`v1.2.3`
`add_v_upper`	`1.2.3`	`V1.2.3`
`strip_v_lower`	`v1.2.3`	`1.2.3`
`strip_v_upper`	`V1.2.3`	`1.2.3`

Credential-Aware Assignment

Jobs requiring credentials are only assigned to agents that have those credentials:

-- ClaimNextPendingScrapeJobWithCredentials
WHERE (
    credential_name IS NULL
    OR credential_name = ANY(available_credentials::VARCHAR[])
)

Flow:

Agent declares available_credentials during heartbeat
Job poll includes credential list
Server only assigns jobs where agent has required credential or job needs no credential

Git Clone Optimization

All git clone operations use shallow clones for performance and reduced disk usage:

Setting	Value	Purpose
`Depth`	1	Only fetch the latest commit (no history)
`SingleBranch`	true	Only fetch the requested branch/tag

Benefits:

Significantly faster clone times, especially for large repositories
Reduced disk usage on agents (no commit history stored)
Lower bandwidth consumption from git servers
Minimal data transfer for version extraction (only need file content, not history)

Implementation (pkg/git/cloner.go):

cloneOpts := &git.CloneOptions{
    URL:           url,
    Depth:         1,           // Shallow clone
    SingleBranch:  true,        // Only requested ref
    ReferenceName: ref,
}

Temporary Directory Lifecycle:

Create temp directory: planekeeper-clone-*
Shallow clone repository
Read target file for version extraction
Delete temp directory (cleanup)

State Machine

Identical to Gather Jobs (see Section 2), with these additional behaviors:

On Success: Create version_snapshot record, trigger async rule evaluation
History Limit: Older snapshots beyond limit are automatically deleted
Manual Jobs: Start and remain in completed status. Version updates via set-version create snapshots and trigger rule evaluation without changing job status

Side Effects

On Success:
- Insert version_snapshot with version, raw_content, commit_sha, metadata
- Trigger ruleEvaluator.EvaluateForOrg() asynchronously
On History Limit Exceeded: Delete oldest snapshots via orphan cleanup service

Version Snapshot Storage

Files: pkg/storage/queries/version_snapshots.sql, pkg/api/tasks_handler.go:661-706

Full History Tracking

Every scrape creates a new version snapshot record, regardless of whether the version has been seen before. This enables proper tracking of version changes including rollbacks.

Example Scenario:

Time T1: Scrape finds version 6.11.1  → Snapshot #1 created
Time T2: Scrape finds version 9.3.7  → Snapshot #2 created
Time T3: Rollback to 6.11.1          → Snapshot #3 created (NOT a duplicate)

Key Design Decisions:

No Unique Constraint on Version: The table allows multiple snapshots with the same version for the same job. This is intentional—each scrape represents a point-in-time observation.
ID-Based Ordering: The “latest” version snapshot is determined by ORDER BY id DESC, not by timestamp. Since IDs are auto-incrementing, this guarantees the most recently inserted row is always returned, regardless of timestamp precision issues.
History Retention: The history_limit setting on scrape jobs controls how many snapshots to retain. Older snapshots beyond this limit are automatically purged.

Database Schema

-- No unique constraint on version - allows duplicate versions for rollback tracking
CREATE TABLE version_snapshots (
    id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    organization_id BIGINT NOT NULL,
    scrape_job_id BIGINT,
    repository_url VARCHAR(2048) NOT NULL,
    ref VARCHAR(256),
    target_file VARCHAR(1024) NOT NULL,
    version VARCHAR(256) NOT NULL,
    raw_content TEXT,
    metadata JSONB DEFAULT '{}',
    discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index for efficient latest-version lookups
CREATE INDEX idx_version_snapshots_latest
ON version_snapshots(scrape_job_id, id DESC);

Query Pattern

-- GetLatestVersionSnapshot: Returns most recently inserted snapshot
SELECT version, discovered_at
FROM version_snapshots
WHERE scrape_job_id = $1
ORDER BY id DESC  -- NOT discovered_at - ID is more reliable
LIMIT 1;

Why ID instead of timestamp?

Timestamps can have precision issues (same second, microsecond truncation)
Multiple inserts in quick succession may get identical timestamps
Auto-incrementing IDs guarantee insertion order
Avoids race conditions during concurrent operations

Rule Evaluation Timing

When a scrape job completes (agent-based):

1. Agent sends task completion to API
2. API creates version snapshot (INSERT)
3. API marks job as completed
4. API triggers rule evaluation asynchronously
   └── EvaluateForOrg() spawns goroutine internally
5. Rule engine queries for latest version snapshot
   └── Uses id DESC to get most recently inserted row

When a manual version is set:

1. User calls POST /scrape-jobs/{id}/set-version with version string
2. API applies version transform (if configured)
3. API creates version snapshot using shared recordVersionSnapshot helper
4. API triggers rule evaluation asynchronously (same as agent flow)

The rule evaluation runs asynchronously (via goroutine) but uses a fresh database connection that sees the committed snapshot data.

4. Helm Sync Jobs Domain

Files: pkg/api/helm_sync_handlers.go, pkg/taskengine/processor.go:293-427

Purpose

Automatically discover Helm charts from a repository and create/manage child gather jobs for each chart, enabling bulk monitoring of Helm-based deployments.

Input Parameters

Field	Required	Description
`repository_url`	Yes	Helm repository URL
`chart_filter`	No	Regex to filter charts by name
`default_schedule`	No	Cron schedule inherited by child jobs
`default_tag_filter`	No	Tag filter inherited by child jobs
`default_version_regex`	No	Version regex inherited by child jobs
`auto_delete_orphans`	No	Delete child jobs for removed charts

Business Rules

Chart Discovery:

Fetch index.yaml from repository
Apply optional chart_filter regex
Extract chart name, description, latest version

Child Job Management:

For each discovered chart:
├── Build artifact_name: "repo-url/chart-name"
├── Check if gather job exists for this artifact
│   ├── YES: Skip (job already exists)
│   └── NO: Create gather job with:
│       ├── source_type = helm_repository
│       ├── schedule = default_schedule
│       ├── tag_filter = default_tag_filter
│       ├── version_regex = default_version_regex
│       └── parent_sync_job_id = this job's ID
└── Track artifact_name in discovered list

Orphan Deletion (when auto_delete_orphans = true):

DELETE FROM gather_jobs
WHERE parent_sync_job_id = @sync_job_id
  AND artifact_name NOT IN (@discovered_artifacts)

State Machine

Same as Gather Jobs, with additional post-completion processing to create/delete child jobs.

Side Effects

On Success:
- Create gather jobs for newly discovered charts
- Delete gather jobs for removed charts (if auto_delete_orphans)
- Reschedule if cron schedule exists

5. Task Execution System

Files: pkg/taskengine/dispatcher.go, pkg/taskengine/types.go, pkg/shared/types.go, pkg/taskengine/errors.go, pkg/api/tasks_handler.go

Purpose

Coordinate distributed job execution between agents and server through a token-based system with idempotency guarantees.

Token Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│                      TOKEN LIFECYCLE                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────┐    ┌────────────┐    ┌────────────┐    ┌────────┐  │
│  │GENERATION│───▶│ VALIDATION │───▶│ COMPLETION │───▶│CONSUMED│  │
│  └──────────┘    └────────────┘    └────────────┘    └────────┘  │
│       │                │                 │               │        │
│       ▼                ▼                 ▼               ▼        │
│  UUID created    Check expiry     Mark complete    Token          │
│  Expiry set      Check status     Process result   invalidated    │
│  Stored in DB    Return task      Update job                      │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Generation:

New UUID via uuid.New()
Expiry: time.Now().UTC().Add(timeout) (default 5 minutes, configurable via task_execution_timeout_seconds)
Stored in task_executions table

Validation:

Lookup by token
Check status is in_progress
Check not expired

Completion:

Update status to completed or failed
Process job-specific results
Clear execution link from job

Assignment Flow

Agent polls: POST /tasks/{AgentUUID}/poll
    │
    ├── Capabilities: ["gather", "scrape", "helm_sync"]
    ├── Available credentials: ["github_token", "ssh_key"]
    ├── Organization ID: from agent's API key
    │
    ▼
Dispatcher.AssignTaskWithCredentials()
    │
    ├── Try ClaimNextPendingGatherJob (SKIP LOCKED, org-scoped)
    ├── Try ClaimNextPendingScrapeJob (SKIP LOCKED, org-scoped)
    └── Try ClaimNextPendingHelmSyncJob (SKIP LOCKED, org-scoped)
    │
    ▼
CreateTaskExecution() + Return TaskAssignment
    │
    ├── execution_token: UUID
    ├── expires_at: timestamp
    ├── job_type: "gather"|"scrape"|"helm_sync"
    ├── job_id: int64
    └── task_data: job-specific details

Completion Flow

Agent completes: POST /tasks/{AgentUUID}/complete
    │
    ├── execution_token: UUID
    ├── success: bool
    ├── error: optional string
    └── result_data: JSON
    │
    ▼
CompleteExecution()
    │
    ├── Lookup execution by token
    │   └── Not found → ErrTokenInvalid (409)
    │
    ├── Check status == 'in_progress'
    │   └── Already completed → ErrTokenAlreadyCompleted (202)
    │
    ├── Check not expired
    │   └── Expired → ErrTokenExpired (409)
    │
    └── Update execution status
    │
    ▼
ProcessTaskResult() (job-type specific)
    │
    └── Return 202 Accepted

Error States

Error	HTTP Status	Scenario
`ErrNoTasksAvailable`	204 No Content	No pending tasks match agent capabilities
`ErrTokenInvalid`	409 Conflict	Token not found in database
`ErrTokenExpired`	409 Conflict	Token past expiry timestamp
`ErrTokenAlreadyCompleted`	202 Accepted	Idempotent retry (result already recorded)
`ErrJobNotFound`	500	Job missing during result processing
`ErrUnsupportedJobType`	500	Unknown job type in execution

Idempotency Guarantees

Database-Level (SKIP LOCKED):

SELECT id FROM jobs
WHERE status = 'pending'
  AND next_run_at <= CURRENT_TIMESTAMP
ORDER BY next_run_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED

Prevents multiple agents from claiming the same job.

Organization-Level (Org-Scoped Claims):

All ClaimNextQueued* queries filter by organization_id, ensuring agents only claim jobs belonging to their own organization. The serveragent (org_id=0) only claims global jobs; clientagents only claim their respective organization’s jobs. This enforces multi-tenant isolation at the task claim layer.

Token-Level (State Check):

if execution.Status != repository.JobStatusInProgress {
    return nil, ErrTokenAlreadyCompleted
}

Allows safe agent retries without duplicate processing.

6. Background Scheduler

Files: pkg/taskengine/engine.go, pkg/taskengine/scheduler.go, pkg/taskengine/processor.go

Processing Loops

The TaskEngine runs two concurrent background loops:

Loop	Interval	Purpose	Batch Size
Processor	5 seconds	Process pending results	100
Scheduler	30 seconds	Timeout detection + scheduled activation	100

Timeout Handling

Stale Job Detection: Jobs in in_progress status for more than 1 hour are reset to pending:

-- ResetStaleGatherJobs / ResetStaleScrapeJobs
UPDATE jobs
SET status = 'pending',
    claimed_by = NULL,
    claimed_at = NULL
WHERE status = 'in_progress'
  AND claimed_at < NOW() - INTERVAL '1 hour'

Purpose: Recover from agent crashes where no completion/failure was reported.

Scheduled Activation

Jobs with cron schedules follow this lifecycle:

┌──────────┐   ┌─────────────┐   ┌───────────┐   ┌───────────┐
│ COMPLETED│──▶│  RESCHEDULE │──▶│  WAITING  │──▶│ ACTIVATED │
│          │   │             │   │           │   │           │
│ Job done │   │ Calculate   │   │next_run_at│   │ Set status│
│          │   │ next_run_at │   │ in future │   │ = pending │
└──────────┘   └─────────────┘   └───────────┘   └───────────┘
                                       │               │
                                       │  time passes  │
                                       └───────────────┘

Rescheduling (Processor, on completion):

nextRun := cron.NextRun(job.Schedule, time.Now().UTC())
repo.RescheduleJob(jobID, nextRun)

Activation (Scheduler, every 30 seconds):

-- GetScheduledJobsReadyToRun
SELECT id FROM jobs
WHERE status = 'completed'
  AND schedule IS NOT NULL
  AND next_run_at <= CURRENT_TIMESTAMP

-- ActivateScheduledJob
UPDATE jobs
SET status = 'pending',
    next_run_at = NULL
WHERE id = @id

Orphan Detection

Service-Based Orphan Recovery: Jobs claimed by agents no longer in service_instances table are reset:

-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending',
    claimed_by = NULL
WHERE status IN ('pending', 'in_progress')
  AND claimed_by IS NOT NULL
  AND claimed_by NOT IN (
    SELECT instance_uuid FROM service_instances WHERE service_id = 2
  )

Orphan Cleanup Service (Server):

Runs every 2 minutes
30-second startup delay (allows agents to register)
Also enforces version snapshot history limits

7. Rules Engine

Files: pkg/rules/evaluator.go, pkg/rules/engine.go, pkg/rules/types.go

Rule Types

Rule Type	Measures	Calculation
`days_behind`	Age of discovered version	`time.Since(releaseDate).Hours() / 24`
`majors_behind`	Major version difference	`latestMajor - discoveredMajor`
`minors_behind`	Minor version difference	Release-list counting or formula-based

Threshold Tiers

Each rule defines three thresholds (must be ordered: moderate ≤ high ≤ critical):

type Rule struct {
    ModerateThreshold  int    // First tier
    HighThreshold      int    // Second tier
    CriticalThreshold  int    // Third tier
}

Severity Determination (highest wins):

if behindBy >= CriticalThreshold → CRITICAL
else if behindBy >= HighThreshold → HIGH
else if behindBy >= ModerateThreshold → MODERATE
else → No violation

Evaluation Algorithm

1. Validate inputs (config, rule, discovered version required)
       │
       ▼
2. Get latest release from gather job
   └── If none available → No violation (can't compare)
       │
       ▼
3. Apply stable_only filter (if enabled)
   └── If latest is prerelease → Find next stable release
       │
       ▼
4. Calculate behindBy based on rule type:
   │
   ├── days_behind:
   │   └── Get discovered version's release date
   │       └── If not found → ErrVersionNotFound → CRITICAL
   │   └── behindBy = days since release
   │
   ├── majors_behind:
   │   └── Parse both versions as semver
   │       └── If parse fails → ErrVersionParseFailed → CRITICAL
   │   └── behindBy = latestMajor - discoveredMajor
   │
   └── minors_behind:
       ├── Preferred: Count unique major.minor from releases list
       └── Fallback: Formula calculation
       │
       ▼
5. Determine severity from thresholds
       │
       ▼
6. Return EvaluationResult with severity, behindBy, message

Minors Behind Calculation

Release-List Mode (preferred): Counts unique major.minor combinations between discovered and latest versions.

Formula Mode (fallback when releases unavailable):

Same major:      latestMinor - discoveredMinor
Different major: (latestMajor - discoveredMajor) + latestMinor

Example: 7.9 → 8.1 = (8-7) + 1 = 2 minors behind
Example: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind

Stable-Only Filtering

When stable_only = true, prereleases are excluded from latest version lookup.

Prerelease Detection (IsStableVersion()): Returns false if version contains (case-insensitive):

alpha, beta, rc, dev, snapshot, canary, nightly, pre

Special Cases

Version Not Found:

Trigger: Discovered version has no release record (for days_behind)
Result: Marked as CRITICAL with behindBy = -1
Message: "Version X not found in upstream releases - cannot determine age (marked as critical)"

Version Parse Failure:

Trigger: Semver parsing fails for either version
Result: Marked as CRITICAL with behindBy = -1
Message: "Cannot parse version for comparison: <error> (marked as critical)"

8. Alert System

Files: pkg/alerts/service.go, pkg/api/alerts_handlers.go, pkg/api/alert_configs_handler.go, pkg/rules/engine.go, pkg/storage/queries/alerts.sql

Core Concepts

One Alert Per Config: Each alert configuration has at most ONE active alert at any time. When a scrape job discovers a new version, the existing alert is updated in place rather than creating a new alert. This ensures alerts always reflect the current state of the monitored system.

Soft Delete: Resolved alerts are soft-deleted (marked with resolved_at timestamp) rather than permanently deleted. This preserves alert history for auditing and analysis.

Alert Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│                     ALERT LIFECYCLE                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌────────────────┐    ┌─────────────────────┐  │
│  │ CREATED  │───▶│ VERSION UPDATE │───▶│    ACKNOWLEDGED     │  │
│  │          │    │                │    │                     │  │
│  │ Violation│    │ Alert updated  │    │ User marks as       │  │
│  │ detected │    │ with new       │    │ reviewed            │  │
│  └──────────┘    │ version data   │    └─────────────────────┘  │
│       │          └────────────────┘             │                │
│       │                 │                       │                │
│       │                 │     ┌─────────────────┘                │
│       │                 │     │                                  │
│       │                 ▼     ▼                                  │
│       │          ┌──────────────────┐                            │
│       │          │  UNACKNOWLEDGED  │                            │
│       │          │                  │                            │
│       │          │  Ack reset when  │                            │
│       │          │  version changes │                            │
│       │          └──────────────────┘                            │
│       │                 │                                        │
│       │                 │                                        │
│       ▼                 ▼                                        │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │                       RESOLVED                              │  │
│  │  Version updated, no longer violates                        │  │
│  │  → Alert soft-deleted (resolved_at = now)                   │  │
│  │  → Preserved in history via /alerts/resolved endpoint       │  │
│  └────────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Database Schema

Key Fields:

Field	Type	Description
`alert_config_id`	BIGINT	Link to alert configuration (unique per active alert)
`discovered_version`	VARCHAR	Version found by scrape job
`latest_version`	VARCHAR	Latest upstream version
`behind_by`	INT	How far behind (days, majors, or minors)
`severity`	ENUM	moderate, high, or critical
`is_acknowledged`	BOOLEAN	Whether user has acknowledged
`resolved_at`	TIMESTAMP	When resolved (NULL = active)

Unique Constraint: Partial unique index on alert_config_id WHERE resolved_at IS NULL

Ensures only ONE active alert per config
Allows multiple resolved alerts in history

Upsert Behavior

Alerts use a partial unique index for upsert operations:

INSERT INTO alerts (...)
ON CONFLICT (alert_config_id) WHERE resolved_at IS NULL
DO UPDATE SET
    discovered_version = EXCLUDED.discovered_version,
    latest_version = EXCLUDED.latest_version,
    behind_by = EXCLUDED.behind_by,
    severity = EXCLUDED.severity,
    details = EXCLUDED.details,
    updated_at = CURRENT_TIMESTAMP,
    -- Only reset acknowledgement when version actually changes
    is_acknowledged = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN FALSE
        ELSE alerts.is_acknowledged
    END,
    acknowledged_by = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
        ELSE alerts.acknowledged_by
    END,
    acknowledged_at = CASE
        WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
        ELSE alerts.acknowledged_at
    END

Key Behaviors:

Single Alert: Each config has at most one active alert
In-Place Updates: When the scrape job discovers a new version, the existing alert updates with the new version data
Smart Acknowledgment Reset: Acknowledgment is only cleared when the discovered version changes, not on every evaluation
Preserved History: Resolved alerts remain in the database with resolved_at set

Alert Resolution (Soft Delete)

When an alert config no longer violates (e.g., version was updated):

-- ResolveAlert: Soft delete by setting resolved_at
UPDATE alerts
SET resolved_at = CURRENT_TIMESTAMP,
    updated_at = CURRENT_TIMESTAMP
WHERE id = @id AND resolved_at IS NULL
RETURNING *;

Resolution triggers notification: When an alert is resolved, an alert.resolved notification is dispatched to configured channels.

Audit Trail (Alert Actions)

Every alert state change is recorded in the audit_trail table using a decorator pattern. The AuditedAlertService wraps the AlertService interface and transparently records entries after each successful operation.

Tracked Actions:

Action	Trigger	Source
`created`	Rule evaluation creates a new alert	`system`
`escalated`	Rule evaluation increases severity	`system`
`acknowledge`	User acknowledges via UI/API/webhook	`ui`, `api`, or `webhook`
`unacknowledge`	User removes acknowledgment, or version changes reset it	`ui`, `api`, or `system`
`resolve`	User manually resolves, or rule evaluation resolves	`ui`, `api`, or `system`

Source Determination: The source is derived from the authentication method used:

JWT authentication → ui (browser-based user action)
API key authentication → api (programmatic access)
Webhook callback token → webhook (external callback)
No auth context (system operation) → system (rules engine, auto-reset)

Data Model: The audit_trail table uses a polymorphic design (entity_type + entity_id) that can be extended to other entity types in the future. Each entry stores the action type, source, optional actor email (for user-initiated actions), and optional metadata as JSONB. Audit records persist independently of entity lifecycle (no cascade delete on the alert FK).

API Endpoint: GET /alerts/{id}/actions returns paginated audit trail entries for a specific alert. The alert detail UI merges these entries with notification deliveries into a unified activity timeline sorted by timestamp.

Prometheus Metrics: The audit writer exposes planekeeper_audit_writer_events_written_total, planekeeper_audit_writer_persist_errors_total, and planekeeper_audit_writer_insert_duration_seconds_total via the /metrics endpoint.

Auto-Triggers (Event-Driven)

Rule evaluation is triggered asynchronously via the event bus system (pkg/events/). When triggering events occur, they are published to the event bus, and the RuleEvaluationSubscriber handles the evaluation in a goroutine.

Event Flow:

┌─────────────────────┐    ┌──────────────────┐    ┌─────────────────────────┐
│  Job Completion /   │───▶│    Event Bus     │───▶│ RuleEvaluationSubscriber│
│  Config Change      │    │  (pkg/events/)   │    │ (async goroutine)       │
└─────────────────────┘    └──────────────────┘    └─────────────────────────┘

Triggering Events:

Event Type	Trigger	Published By
`job.scrape.completed`	Scrape job completes successfully	`tasks_handler.go`
`job.gather.completed`	Gather job completes successfully	`tasks_handler.go`
`alert_config.created`	Alert config created and active	`alert_configs_handler.go`
`alert_config.updated`	Alert config updated and active	`alert_configs_handler.go`
`alert_config.toggled`	Alert config toggled to active	`alert_configs_handler.go`

Benefits of Event-Driven Triggers:

Decoupling: Handlers don’t need direct references to the rule evaluation logic
Non-blocking: HTTP handlers return immediately; evaluation runs async
Extensibility: Other subscribers can react to the same events (metrics, logging, etc.)
Scalability: Event bus can be swapped for external queue (Redis, NATS) for distributed evaluation

Alert Config Composition

An alert config links three entities:

┌─────────────────────────────────────────────────────────────┐
│                    ALERT CONFIG                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  scrape_job_id  ───▶  "What version did we deploy?"         │
│                       (discovered_version)                   │
│                                                              │
│  gather_job_id  ───▶  "What's the latest available?"        │
│                       (latest_version)                       │
│                                                              │
│  rule_id        ───▶  "How do we measure staleness?"        │
│                       (days_behind, majors_behind, etc.)     │
│                                                              │
│  UNIQUE (org_id, scrape_job_id, gather_job_id, rule_id)     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Active Alerts:

Method	Path	Description
GET	`/alerts`	List active (non-resolved) alerts
GET	`/alerts/summary`	Count active alerts by severity
POST	`/alerts/{id}/acknowledge`	Acknowledge an active alert
POST	`/alerts/{id}/unacknowledge`	Remove acknowledgement
POST	`/alerts/acknowledge`	Bulk acknowledge multiple alerts

Resolved Alerts (History):

Method	Path	Description
GET	`/alerts/resolved`	List resolved alerts with pagination

Query Parameters (for both endpoints):

limit - Max results (default 50, max 100)
offset - Pagination offset
severity - Filter by severity level
unacknowledged_only - Only show unacknowledged (active only)

Notification Events

The alert system dispatches notifications for lifecycle events:

Event	Trigger
`alert.created`	New alert created (first violation)
`alert.escalated`	Severity increased (e.g., high → critical)
`alert.acknowledged`	User acknowledges via API
`alert.unacknowledged`	User removes acknowledgement
`alert.resolved`	Version updated, no longer violates

Note: Non-escalating updates (same severity, just refreshed data) do not trigger notifications.

Event-Driven Alert Service

Files: pkg/alerts/service.go

All alert state changes flow through a centralized alert service (pkg/alerts/Service) that automatically dispatches notifications. This ensures consistent notification behavior regardless of how alerts are modified (API, webhook, rules engine).

Architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                      ALERT STATE CHANGES                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────┐                                                 │
│  │ API Handlers    │──┐                                              │
│  │ (ack/unack)     │  │                                              │
│  └─────────────────┘  │                                              │
│                       │      ┌─────────────────────────────────────┐ │
│  ┌─────────────────┐  │      │        ALERT SERVICE               │ │
│  │ Webhook Ack     │──┼─────▶│  pkg/alerts/service.go             │ │
│  │ (external)      │  │      │                                     │ │
│  └─────────────────┘  │      │  • Acknowledge()                    │ │
│                       │      │  • Unacknowledge()                  │ │
│  ┌─────────────────┐  │      │  • Upsert() (create/update)         │ │
│  │ Rules Engine    │──┘      │  • Resolve()                        │ │
│  │ (evaluation)    │         │                                     │ │
│  └─────────────────┘         │  ─────────────────────────────────  │ │
│                              │  Auto-dispatches notifications:     │ │
│                              │  • alert.created                    │ │
│                              │  • alert.escalated                  │ │
│                              │  • alert.acknowledged               │ │
│                              │  • alert.unacknowledged             │ │
│                              │  • alert.resolved                   │ │
│                              └──────────────┬──────────────────────┘ │
│                                             │                        │
│                                             ▼                        │
│                              ┌─────────────────────────────────────┐ │
│                              │   Notification Dispatcher           │ │
│                              │   pkg/notifications/dispatcher.go   │ │
│                              └─────────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Service Methods:

Method	Description	Event Dispatched
`Acknowledge()`	Mark alert acknowledged	`alert.acknowledged`
`Unacknowledge()`	Remove acknowledgment	`alert.unacknowledged`
`Upsert()`	Create or update alert	`alert.created` or `alert.escalated`
`Resolve()`	Mark alert resolved	`alert.resolved`
`ResolveByConfigID()`	Resolve by config	`alert.resolved`
`BulkAcknowledge()`	Bulk operation	None (avoids spam)

Benefits:

Consistent Notifications: Every state change automatically dispatches events
Single Source of Truth: All alert logic centralized in one service
Thin Handlers: API handlers become simple pass-through to service
Testability: Service can be mocked for unit testing

Usage Pattern:

// Before (scattered notification dispatch):
alert, err := repo.AcknowledgeAlert(ctx, params)
dispatcher.DispatchForAlert(ctx, alert, EventAlertAcknowledged)

// After (single service call does everything):
alert, err := alertService.Acknowledge(ctx, params)
// Notification automatically dispatched

Event Bus System

Files: pkg/events/bus.go, pkg/events/types.go, pkg/events/subscribers.go

The system uses an in-process event bus for decoupled asynchronous communication between components. This enables loose coupling while maintaining reliability within a single process.

Architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                        EVENT BUS SYSTEM                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Publishers                     Event Bus                Subscribers │
│  ───────────                    ─────────                ─────────── │
│                                                                      │
│  ┌─────────────────┐      ┌─────────────────────┐                    │
│  │ TasksHandler    │──┐   │                     │   ┌──────────────┐ │
│  │ (job completed) │  │   │  In-process Bus     │   │ RuleEval     │ │
│  └─────────────────┘  │   │                     │   │ Subscriber   │ │
│                       ├──▶│  • Buffered channel │──▶│              │ │
│  ┌─────────────────┐  │   │  • Async delivery   │   │ Evaluates    │ │
│  │ AlertConfigs    │──┘   │  • Panic recovery   │   │ rules for    │ │
│  │ (config change) │      │                     │   │ organization │ │
│  └─────────────────┘      └─────────────────────┘   └──────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Event Types:

Event Type	Description	Payload
`job.scrape.completed`	Scrape job finished	`JobCompletedEvent`
`job.gather.completed`	Gather job finished	`JobCompletedEvent`
`job.helm_sync.completed`	Helm sync job finished	`JobCompletedEvent`
`alert_config.created`	Alert config created	`AlertConfigChangedEvent`
`alert_config.updated`	Alert config updated	`AlertConfigChangedEvent`
`alert_config.toggled`	Alert config toggled	`AlertConfigChangedEvent`

Event Bus Features:

Buffered Channel: Default buffer of 100 events; non-blocking publish with overflow warning
Async Delivery: Single dispatcher goroutine processes events sequentially
Panic Recovery: Handler panics are caught and logged without crashing the bus
Graceful Shutdown: Stop() method drains pending events before closing

Subscriber Pattern:

// Subscribers implement their own logic
type RuleEvaluationSubscriber struct {
    db           *postgres.Database
    alertService *alerts.Service
}

// Subscribe registers handlers for relevant event types
func (s *RuleEvaluationSubscriber) Subscribe(bus *events.Bus) {
    bus.SubscribeMultiple([]events.EventType{
        events.EventJobScrapeCompleted,
        events.EventJobGatherCompleted,
    }, s.handleJobCompleted)
}

Future Extensibility:

The event bus interface can be swapped for an external message queue (Redis Streams, NATS, RabbitMQ) if horizontal scaling requires distributed event processing. The subscriber pattern remains the same; only the transport layer changes.

9. Notification System

Files: pkg/notifications/, pkg/api/notification_*_handlers.go, app/notifier/, pkg/storage/queries/notification_*.sql

Purpose

Deliver notifications about alert events to external systems via webhooks and other channel types. The system supports configurable routing rules, retry logic with exponential backoff, and acknowledgment callbacks.

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         SERVER                                    │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │                    Dispatcher                                │ │
│  │  - Evaluates notification rules on alert events              │ │
│  │  - Creates delivery records in notification_deliveries       │ │
│  │  - Does NOT send webhooks directly                           │ │
│  └──────────────────────────┬──────────────────────────────────┘ │
└─────────────────────────────┼────────────────────────────────────┘
                              │ INSERT into notification_deliveries
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    PostgreSQL                                    │
│  notification_deliveries (status: pending)                       │
└─────────────────────────────┬───────────────────────────────────┘
                              │ FOR UPDATE SKIP LOCKED
                              ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  notifier   │  │  notifier   │  │  notifier   │
│  replica 1  │  │  replica 2  │  │  replica N  │
│  (worker)   │  │  (worker)   │  │  (worker)   │
└─────────────┘  └─────────────┘  └─────────────┘
      │                │                │
      └────────────────┼────────────────┘
                       ▼
              External Webhooks

9.1 Channel Types

Channel Type	Status	Description
`webhook`	Implemented	HTTP POST to external URL with JSON payload
`pagerduty`	Planned	Native PagerDuty Events API integration
`telegram`	Planned	Telegram Bot API integration
`smtp`	Planned	Email notifications

9.2 Notification Channels

Table: notification_channels

Channels define delivery endpoints (webhooks) with organization-scoped configuration.

Field	Description
`id`	Unique identifier
`organization_id`	Owning organization
`name`	Human-readable name
`channel_type`	Type: webhook, pagerduty, telegram, smtp
`config`	JSONB with channel-specific configuration
`is_active`	Whether channel is enabled
`max_retries`	Per-channel retry override (NULL = global default)
`last_test_at`	Last test timestamp
`last_test_success`	Last test result

Webhook Config Structure (stored in config JSONB):

{
  "url": "https://example.com/webhook",
  "method": "POST",
  "headers": {"Authorization": "Bearer xxx"},
  "timeout_seconds": 30,
  "ack_enabled": true,
  "secret": "hmac-signing-secret",
  "event_templates": {
    "new_alert": "",
    "acknowledged": "",
    "resolved": ""
  }
}

9.2.1 Event-Specific Templates

The notification system supports event-specific templates that allow customizing webhook payloads for different alert lifecycle events. This enables integration with services like Discord and Slack that require specific payload formats.

Template Categories:

Category	Events	Description
`new_alert`	`alert.created`, `alert.escalated`, `test`	New or escalated alerts
`acknowledged`	`alert.acknowledged`, `alert.unacknowledged`	Acknowledgment state changes
`resolved`	`alert.resolved`	Alert resolution

Template Resolution Priority:

Templates are resolved in order of specificity (most specific wins):

1. Channel-specific template (config.event_templates.X)
       │
       ▼ (if empty)
2. Organization-level template (settings table)
       │
       ▼ (if empty)
3. Global default template (settings table, org_id = NULL)
       │
       ▼ (if empty)
4. Standard JSON payload (no template)

This allows:

Global defaults for all organizations (generic JSON for standard webhooks)
Organization overrides for org-wide customization
Channel-specific templates for platforms like Discord/Slack

Global Template Settings Keys:

Setting Key	Category
`notification.template.alert.new`	New/escalated alerts
`notification.template.alert.acknowledged`	Acknowledgment events
`notification.template.alert.resolved`	Resolution events

9.2.2 Template Variables

Common Variables (available in all templates):

{{.IdempotencyKey}}           - UUID, stable across retries
{{.Event}}                    - Event type (alert.created, etc.)
{{.Timestamp}}                - ISO8601 timestamp
{{.Alert.ID}}                 - Alert ID
{{.Alert.ConfigName}}         - Alert config name
{{.Alert.RuleName}}           - Rule name
{{.Alert.RuleType}}           - days_behind, majors_behind, etc.
{{.Alert.Severity}}           - critical, high, moderate
{{.Alert.DiscoveredVersion}}  - Current deployed version
{{.Alert.LatestVersion}}      - Latest available version
{{.Alert.BehindBy}}           - Number (days, versions, etc.)
{{.Alert.ArtifactName}}       - Upstream artifact name
{{.Alert.RepositoryURL}}      - Scrape job repository URL
{{.Alert.TargetFile}}         - Scrape job target file

Event-Specific Variables:

Variable	Available In	Description
`{{.AcknowledgeURL}}`	`new_alert` only	Callback URL for one-click acknowledgment
`{{.AcknowledgedBy}}`	`acknowledged` only	Email/identifier of acknowledging user
`{{.AcknowledgedAt}}`	`acknowledged` only	ISO8601 timestamp of acknowledgment
`{{.IsAcknowledged}}`	`acknowledged` only	`true` for acknowledged, `false` for unacknowledged
`{{.ResolvedAt}}`	`resolved` only	ISO8601 timestamp of resolution
`{{.PreviousSeverity}}`	`new_alert` (escalated)	Previous severity before escalation

Template Functions:

Function	Description	Example
`upper`	Uppercase string	`{{.Alert.Severity \| upper}}` → `CRITICAL`
`lower`	Lowercase string	`{{.Event \| lower}}` → `alert.created`
`json`	JSON encode value	`{{.Alert \| json}}` → `{"id":1,...}`

9.2.3 Platform-Specific Examples

Generic JSON Webhook (default format when no template configured):

The system sends a structured JSON payload by default:

{
  "idempotency_key": "uuid-here",
  "event": "alert.created",
  "timestamp": "2024-01-15T10:30:00Z",
  "alert": {
    "id": 123,
    "config_name": "My Alert Config",
    "rule_name": "Critical Updates",
    "severity": "critical",
    "discovered_version": "1.0.0",
    "latest_version": "2.0.0",
    "behind_by": 30
  },
  "acknowledge_url": "https://app.example.com/api/v1/webhooks/acknowledge/token"
}

Discord Webhook:

Discord requires a content field. Configure these channel-specific templates:

New Alert Template:

{"content": "🚨 **{{.Alert.Severity | upper}} Alert**: {{.Alert.ConfigName}}\n\n**Artifact:** {{.Alert.ArtifactName}}\n**Current:** {{.Alert.DiscoveredVersion}} → **Latest:** {{.Alert.LatestVersion}}\n**Behind by:** {{.Alert.BehindBy}} {{.Alert.RuleType}}\n\n[Acknowledge]({{.AcknowledgeURL}})"}

Acknowledged Template:

{"content": "{{if .IsAcknowledged}}✅ **Acknowledged**{{else}}🔄 **Unacknowledged**{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\n\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}

Resolved Template:

{"content": "🎉 **Resolved**: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\n\nThe version has been updated and no longer triggers this alert."}

Slack Webhook:

Slack uses a text field for simple messages:

New Alert Template:

{"text": ":rotating_light: *{{.Alert.Severity | upper}}*: {{.Alert.ConfigName}}\nArtifact: {{.Alert.ArtifactName}}\nVersion {{.Alert.DiscoveredVersion}} is {{.Alert.BehindBy}} behind latest ({{.Alert.LatestVersion}})\n<{{.AcknowledgeURL}}|Click to Acknowledge>"}

Acknowledged Template:

{"text": "{{if .IsAcknowledged}}:white_check_mark: *Acknowledged*{{else}}:arrows_counterclockwise: *Unacknowledged*{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}

*Resolved Template*:
```json
{"text": ":tada: *Resolved*: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\nThe alert has been automatically resolved."}

Slack Block Kit (rich formatting):

For more sophisticated Slack messages, use Block Kit:

New Alert Template:

{
  "blocks": [
    {
      "type": "header",
      "text": {"type": "plain_text", "text": "🚨 {{.Alert.Severity | upper}} Alert"}
    },
    {
      "type": "section",
      "fields": [
        {"type": "mrkdwn", "text": "*Config:*\n{{.Alert.ConfigName}}"},
        {"type": "mrkdwn", "text": "*Artifact:*\n{{.Alert.ArtifactName}}"},
        {"type": "mrkdwn", "text": "*Current:*\n{{.Alert.DiscoveredVersion}}"},
        {"type": "mrkdwn", "text": "*Latest:*\n{{.Alert.LatestVersion}}"}
      ]
    },
    {
      "type": "actions",
      "elements": [
        {
          "type": "button",
          "text": {"type": "plain_text", "text": "Acknowledge"},
          "url": "{{.AcknowledgeURL}}",
          "style": "primary"
        }
      ]
    }
  ]
}

Microsoft Teams (Adaptive Cards):

New Alert Template:

{
  "@type": "MessageCard",
  "themeColor": "FF0000",
  "title": "{{.Alert.Severity | upper}} Alert: {{.Alert.ConfigName}}",
  "sections": [{
    "facts": [
      {"name": "Artifact", "value": "{{.Alert.ArtifactName}}"},
      {"name": "Current Version", "value": "{{.Alert.DiscoveredVersion}}"},
      {"name": "Latest Version", "value": "{{.Alert.LatestVersion}}"},
      {"name": "Behind By", "value": "{{.Alert.BehindBy}}"}
    ]
  }],
  "potentialAction": [{
    "@type": "OpenUri",
    "name": "Acknowledge",
    "targets": [{"os": "default", "uri": "{{.AcknowledgeURL}}"}]
  }]
}

PagerDuty Events API v2:

New Alert Template:

{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "{{.IdempotencyKey}}",
  "payload": {
    "summary": "{{.Alert.Severity | upper}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}} is {{.Alert.BehindBy}} behind",
    "source": "planekeeper",
    "severity": "{{.Alert.Severity}}",
    "custom_details": {
      "discovered_version": "{{.Alert.DiscoveredVersion}}",
      "latest_version": "{{.Alert.LatestVersion}}",
      "repository": "{{.Alert.RepositoryURL}}",
      "target_file": "{{.Alert.TargetFile}}"
    }
  },
  "links": [{"href": "{{.AcknowledgeURL}}", "text": "Acknowledge in Planekeeper"}]
}

Resolved Template:

{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "resolve",
  "dedup_key": "{{.IdempotencyKey}}"
}

9.2.4 Configuring Templates

Via UI:

Navigate to Notification Channels → Edit channel
Check “Use Event-Specific Templates”
Enter templates for each event category (New Alert, Acknowledged, Resolved)
Leave empty to inherit from organization/global defaults

Via API:

# Create channel with event-specific templates
curl -X POST /api/v1/client/notification-channels \
  -H "X-API-Key: pk_..." \
  -d '{
    "name": "Discord Alerts",
    "channel_type": "webhook",
    "config": {
      "url": "https://discord.com/api/webhooks/...",
      "event_templates": {
        "new_alert": "{\"content\": \"🚨 **{{.Alert.Severity}}**: {{.Alert.ConfigName}}\"}",
        "acknowledged": "{\"content\": \"✅ Acknowledged: {{.Alert.ConfigName}}\"}",
        "resolved": "{\"content\": \"🎉 Resolved: {{.Alert.ConfigName}}\"}"
      }
    }
  }'

Organization-Level Defaults:

Set organization-wide templates via the settings API:

# Set org default for new alerts
curl -X PUT /api/v1/client/settings/notification.template.alert.new \
  -H "X-API-Key: pk_..." \
  -d '{"value": "{\"content\": \"🚨 {{.Alert.Severity}}: {{.Alert.ConfigName}}\"}"}'

9.2.5 Template Best Practices

Use channel-specific templates for non-standard webhooks: Discord, Slack, Teams, and PagerDuty all have specific payload formats. Configure these at the channel level.
Keep global defaults generic: The default templates produce standard JSON suitable for custom integrations. Don’t change these unless you want all organizations to use a specific format.
Test templates before enabling: Use the channel test endpoint to verify your template produces valid output for the target platform.
Include relevant context per event type:
- New alerts: Include acknowledge URL, version details, severity
- Acknowledged: Include who acknowledged and when
- Resolved: Keep it simple - the alert is no longer actionable
Escape special characters: JSON strings require escaping. Use \n for newlines, \" for quotes within strings.
Use template functions: upper, lower, and json help format output appropriately for different platforms

9.3 Notification Rules

Table: notification_rules

Rules define routing logic: which events go to which channels based on severity and event type.

Field	Description
`id`	Unique identifier
`organization_id`	Owning organization
`name`	Human-readable name
`severity_filter`	Array of severities to match (empty = all)
`event_filter`	Array of event types to match (empty = all)
`channel_id`	Override channel (NULL = use org default)
`group_interval`	Group alerts within this window
`repeat_interval`	Don’t repeat for same alert within this window
`is_active`	Whether rule is enabled
`priority`	Higher priority rules evaluated first

Event Types:

Event	Description
`alert.created`	New violation detected
`alert.escalated`	Severity increased
`alert.acknowledged`	Alert marked as acknowledged
`alert.unacknowledged`	Acknowledgment reset (re-violation)
`alert.resolved`	Version updated, no longer violates

Rule Evaluation:

1. Get active rules for org, ordered by priority DESC
       │
       ▼
2. For each rule:
   ├── Check severity_filter (empty = match all)
   ├── Check event_filter (empty = match all)
   │
   └── If match:
       ├── Check group/repeat intervals (prevent spam)
       ├── Get channel (rule override or org default)
       └── Create delivery record
       │
       ▼
3. Deduplicate channels (same alert → same channel only once)

9.4 Delivery Lifecycle

Table: notification_deliveries

Tracks the state and history of each notification delivery attempt.

┌─────────────────────────────────────────────────────────────────┐
│                   DELIVERY LIFECYCLE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────┐    ┌─────────────┐    ┌───────────┐                │
│  │ PENDING │───▶│ IN_PROGRESS │───▶│ SUCCEEDED │                │
│  └─────────┘    └─────────────┘    └───────────┘                │
│       │               │                                          │
│       │               │  Error/timeout                           │
│       │               ▼                                          │
│       │         ┌──────────┐                                     │
│       │         │  FAILED  │◀─────────────────┐                  │
│       │         └──────────┘                  │                  │
│       │               │                       │                  │
│       │               │ Retry (attempts < max)│                  │
│       │               ▼                       │                  │
│       │         ┌─────────────┐               │                  │
│       │         │ IN_PROGRESS │───────────────┘                  │
│       │         └─────────────┘                                  │
│       │               │                                          │
│       │               │ Max attempts exceeded                    │
│       │               ▼                                          │
│       │         ┌─────────────┐                                  │
│       └────────▶│ DEAD_LETTER │                                  │
│                 └─────────────┘                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Status Transitions:

From	To	Trigger
-	pending	Dispatcher creates delivery
pending	in_progress	Notifier claims delivery (SKIP LOCKED)
in_progress	succeeded	2xx HTTP response
in_progress	failed	Error or non-2xx response (retries remain)
failed	in_progress	Retry timer expires
failed	dead_letter	Max attempts exceeded or 24h TTL

9.5 Retry Logic

Multi-tier Exponential Backoff:

Tier	Attempts	Delays	Use Case
Short-term	1-4	10s, 30s, 1m, 5m	Transient errors
Mid-term	5-8	15m, 30m, 1h, 2h	Service degradation
Long-term	9-12	4h, 4h, 4h, 4h	Extended outage

Total TTL: ~24 hours from first attempt → dead_letter

Jitter: Full jitter applied (delay = random(0, computedDelay)) to prevent thundering herd.

Retry-After Header: Honored from 429/503 responses.

Non-retryable Errors:

4xx responses (except 429) → immediate dead_letter
Invalid configuration → immediate dead_letter

9.6 Webhook Payload

Default JSON Payload:

{
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000",
  "event": "alert.created",
  "timestamp": "2026-02-04T12:00:00Z",
  "alert": {
    "id": 123,
    "config_name": "K8s Dashboard Version Check",
    "rule_name": "Days Behind",
    "severity": "critical",
    "discovered_version": "1.25.0",
    "latest_version": "1.30.0",
    "behind_by": 45,
    "artifact_name": "kubernetes/kubernetes",
    "repository_url": "https://github.com/org/repo",
    "target_file": "chart/Chart.yaml"
  },
  "acknowledge_url": "https://planekeeper.example.com/api/v1/webhooks/acknowledge/{token}"
}

Headers:

Content-Type: application/json
X-Planekeeper-Signature: sha256={hmac} (if secret configured)
X-Planekeeper-Timestamp: {unix_seconds}
X-Planekeeper-Event: alert.created
X-Planekeeper-Idempotency-Key: {uuid}
Custom headers from channel config

HMAC Signature: HMAC-SHA256(secret, timestamp + "." + body)

9.7 Inbound Acknowledgment

Endpoint: POST /api/v1/webhooks/acknowledge/{token}

External systems can acknowledge alerts by calling the acknowledge_url included in the webhook payload.

Token Properties:

Generated per delivery
Stored in notification_ack_tokens table
Expires after 24 hours (configurable)
Single-use (marked as used after acknowledgment)

Flow:

External System                    Planekeeper
      │                                 │
      │  POST /webhooks/acknowledge/xyz │
      │────────────────────────────────▶│
      │                                 │ Lookup token
      │                                 │ Validate not expired
      │                                 │ Validate not already used
      │                                 │ Mark alert as acknowledged
      │                                 │ Mark token as used
      │      200 OK                     │
      │◀────────────────────────────────│

9.8 SSRF Protection

Webhook URLs are validated to prevent Server-Side Request Forgery:

Blocked by default:

Private IP ranges (RFC1918): 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Localhost: 127.0.0.0/8, ::1
Link-local: 169.254.0.0/16, fe80::/10

Allowed schemes: https:// (default), optionally http://

Environment variable: NOTIFICATION_ALLOW_PRIVATE_URLS=false (default)

9.9 Housekeeping

The notifier service runs periodic cleanup tasks:

Task	Interval	Description
Expire ack tokens	1h	Delete expired tokens from `notification_ack_tokens`
Cleanup expired pending	1h	Move stuck deliveries to dead_letter after 24h
Purge old succeeded	daily	Delete succeeded deliveries older than 30 days

9.10 API Endpoints

Notification Channels:

Method	Path	Description
GET	`/notification-channels`	List org’s channels
POST	`/notification-channels`	Create channel
GET	`/notification-channels/{id}`	Get channel
PUT	`/notification-channels/{id}`	Update channel
DELETE	`/notification-channels/{id}`	Delete channel
POST	`/notification-channels/{id}/test`	Test channel connectivity
POST	`/notification-channels/{id}/toggle`	Toggle active state
GET	`/notification-channels/{id}/stats`	Get delivery statistics

9.11 Channel Test Endpoint

The test endpoint (POST /notification-channels/{id}/test) performs comprehensive validation and sends a test notification.

Test Sequence:

Config Validation: Validate webhook URL and template syntax
Connectivity Check: Verify URL is reachable (optional HEAD request)
Sample Delivery: Send actual test payload to webhook
Record Results: Store test timestamp and success/failure

Response Structure (NotificationChannelTestResult):

{
  "success": true,
  "tested_at": "2026-02-05T12:00:00Z",
  "idempotency_key": "test-550e8400-e29b-41d4-a716-446655440000",
  "error": null,
  "validation_errors": [],
  "connectivity_check": {
    "status": 200,
    "latency_ms": 150,
    "error": null
  },
  "sample_delivery": {
    "status": 200,
    "latency_ms": 450,
    "response_preview": "OK",
    "error": null
  }
}

Error Response Examples:

Validation failure:

{
  "success": false,
  "tested_at": "2026-02-05T12:00:00Z",
  "error": "Configuration validation failed",
  "validation_errors": ["Invalid webhook URL: private IP addresses not allowed"]
}

Delivery failure:

{
  "success": false,
  "tested_at": "2026-02-05T12:00:00Z",
  "error": "Test delivery failed",
  "sample_delivery": {
    "status": 400,
    "latency_ms": 250,
    "response_preview": "{\"message\":\"Invalid content type\"}",
    "error": "webhook returned 400"
  }
}

9.12 UI Error Handling Pattern

The clientui uses a standardized pattern for surfacing detailed API errors to users.

Error Flow:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   API       │───▶│  Services   │───▶│  Handler    │───▶│    UI       │
│  Response   │    │   Layer     │    │  Formatter  │    │  Redirect   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                   │                  │                  │
      │ Detailed JSON     │ Extract fields   │ Build message    │ URL-encoded
      │ with nested       │ (status, error,  │ with context     │ query param
      │ error objects     │ preview, etc.)   │ (HTTP codes,     │ (?error=...)
      │                   │                  │ response text)   │

Services Layer (internal/services/api_client.go):

The services layer extracts all error details from API responses:

type NotificationChannelTestResult struct {
    Success          bool
    ErrorMessage     *string
    TestedAt         *time.Time
    ValidationErrors []string

    // Connectivity check results
    ConnectivityStatus  *int
    ConnectivityError   *string
    ConnectivityLatency *int64

    // Sample delivery results
    DeliveryStatus   *int
    DeliveryError    *string
    DeliveryLatency  *int64
    ResponsePreview  *string
}

Handler Formatter (internal/handlers/notification_channels.go):

Handlers format user-friendly error messages from detailed results:

func formatTestErrorMessage(result *NotificationChannelTestResult) string {
    var parts []string

    // Check validation errors first
    if len(result.ValidationErrors) > 0 {
        parts = append(parts, "Validation errors: "+result.ValidationErrors[0])
    }

    // Check delivery issues (most common)
    if result.DeliveryStatus != nil && *result.DeliveryStatus >= 400 {
        msg := "Delivery failed with HTTP " + strconv.Itoa(*result.DeliveryStatus)
        if result.ResponsePreview != nil {
            msg += " - " + truncate(*result.ResponsePreview, 100)
        }
        parts = append(parts, msg)
    }

    // URL-encode for redirect
    return urlEncode(strings.Join(parts, "; "))
}

UI Display:

Errors are passed via URL query parameters and displayed in the page template:

/notification-channels/1?error=Delivery+failed+with+HTTP+400+-+%7B%22message%22%3A%22Invalid+payload%22%7D

The template renders this as a styled error banner showing:

Delivery failed with HTTP 400 - {“message”:“Invalid payload”}

Key Principles:

Preserve Context: Pass HTTP status codes, response bodies, and specific error types through all layers
Prioritize Actionable Info: Show validation errors first, then HTTP status, then generic errors
Truncate for Safety: Limit response previews to prevent URL length issues
URL-Safe Encoding: Properly encode error messages for query parameter use

Notification Rules:

Method	Path	Description
GET	`/notification-rules`	List org’s rules
POST	`/notification-rules`	Create rule
GET	`/notification-rules/{id}`	Get rule
PUT	`/notification-rules/{id}`	Update rule
DELETE	`/notification-rules/{id}`	Delete rule
POST	`/notification-rules/{id}/toggle`	Toggle active state

Delivery History:

Method	Path	Description
GET	`/notification-deliveries`	List deliveries (filterable)
GET	`/notification-deliveries/dead`	List dead letters
POST	`/notification-deliveries/{id}/retry`	Retry a dead letter
GET	`/alerts/{id}/deliveries`	Deliveries for specific alert

9.13 Notifier Service

The notifier binary (service_id=7) is a standalone worker that:

Polls for pending/failed deliveries ready for retry
Claims deliveries using FOR UPDATE SKIP LOCKED (distributed locking)
Sends webhooks to configured URLs
Updates delivery status based on response
Runs housekeeping tasks periodically

Scaling: Run multiple replicas for horizontal scaling. Each replica claims different deliveries without coordination.

Configuration:

Variable	Description	Default
`NOTIFICATION_BATCH_SIZE`	Deliveries to claim per poll	`100`
`NOTIFICATION_POLL_INTERVAL`	How often to check for work	`5s`
`NOTIFICATION_MAX_RETRIES`	Max attempts before dead letter	`12`
`NOTIFICATION_ACK_TOKEN_EXPIRY`	Token expiry duration	`24h`
`NOTIFICATION_BASE_URL`	Base URL for ack callbacks	-

10. Multi-Tenancy Model

Files: pkg/api/middleware.go, pkg/api/middleware_auth.go, pkg/storage/queries/gather_jobs.sql, pkg/storage/queries/organization_members.sql

10.1 Authentication Methods

The system supports two authentication methods:

Method	Used By	How It Works
API Key	Agents, machines, InternalUI	`X-API-Key` header or `planekeeper_api_key` cookie
Supabase JWT	Human users (ClientUI)	`Authorization: Bearer` header + `X-Organization-Id` header

Dual Auth Middleware (pkg/api/middleware_auth.go):

The API server tries JWT auth first, then falls back to API key auth. When Supabase is not configured (no SUPABASE_JWT_SECRET), only API key auth is available.

Incoming Request
    │
    ├── Has "Authorization: Bearer" header?
    │   ├── YES → Validate JWT → Lookup user by supabase_id
    │   │         → Read X-Organization-Id → Verify membership → Allow
    │   └── NO  → Fall through
    │
    ├── Has "X-API-Key" header or cookie?
    │   ├── YES → Validate API key → Extract org_id → Allow
    │   └── NO  → 401 Unauthorized

10.2 Organization Scoping

All API requests are scoped to an organization:

API Key auth: Organization derived from api_keys.organization_id in the database
JWT auth: Organization specified by the X-Organization-Id header, validated against the user’s membership

organization_id stored in request context (Locals)
    │
    ▼
All queries filter by organization_id

10.3 Multi-Organization Membership

Users can belong to multiple organizations via the organization_members join table.

Schema:

Table	Key Columns	Purpose
`organization_members`	user_id, organization_id, role	Membership records
`organization_invites`	email, organization_id, role, token	Pending invitations

Roles (org_role enum):

Role	Capabilities
`owner`	Full control, manage members
`admin`	Manage resources, invite members
`member`	Read/write org resources

Membership Flow:

User signs up (Supabase) → is_approved = FALSE → Pending approval page
    │
    ▼
Admin approves (SQL: UPDATE users SET is_approved = TRUE WHERE email = '...')
    │
    ▼
User logs out and back in → No memberships → Onboarding page
    │
    ├── Create new organization → Owner membership created
    │
    └── Accept pending invite → Member/admin membership created

10.8 Beta User Approval Gating

Files: pkg/storage/migration/sql/023_user_approval.sql, pkg/auth/middleware.go, internal/handlers/auth.go

New user signups are blocked until an administrator manually approves them. This is enforced via the is_approved column on the users table.

How it works:

Scenario	`is_approved`	Behavior
Existing users (pre-migration)	`TRUE` (default)	Unaffected
System users	`TRUE` (default)	Unaffected
New Supabase signups	`FALSE` (explicit)	Redirected to pending approval page
Admin-approved users	`TRUE` (manual SQL)	Normal login flow

Enforcement points (defense-in-depth):

processLogin (primary): After finding/creating the user, checks is_approved. If false, redirects to /pending-approval instead of checking org membership.
RequireOnboarded middleware (secondary): Checks session.IsApproved and redirects to /pending-approval. Prevents bypassing via direct URL access.

Session caching: The is_approved value is stored in the encrypted session cookie (SessionData.IsApproved). Users must log out and back in after being approved for the change to take effect.

Admin approval (SQL):

-- Approve a user
UPDATE users SET is_approved = TRUE WHERE email = 'user@example.com';

-- List unapproved users
SELECT id, email, created_at FROM users WHERE is_approved = FALSE AND is_system = FALSE;

10.4 Organization Switching (ClientUI)

Users with multiple org memberships can switch between them. The active organization is stored in an HTTP-only cookie (planekeeper_org).

POST /switch-org (form: org_id)
    │
    ▼
Validate user is a member of target org
    │
    ▼
Update planekeeper_org cookie
    │
    ▼
Redirect to dashboard (now showing new org's data)

The sidebar displays an organization dropdown when the user belongs to multiple orgs.

10.5 Scope Types

Scope	organization_id	is_global	Use Case
Organization	Valid (e.g., 123)	false	Tenant-specific resources
Global	NULL	true	Shared across all orgs

10.6 System API Keys

Identification: organization_id = 0 in auth context (NULL in database).

Capabilities:

Create global resources (is_global = true)
Access cross-organization data
Manage system settings

Creation Flow:

if isSystemKey {
    orgIDParam = pgtype.Int8{}  // NULL
    isGlobal = pgtype.Bool{Bool: true, Valid: true}
}

10.7 List Query Scopes

Most list endpoints support a scope query parameter:

Scope	Query Filter
`org`	`WHERE organization_id = @org_id AND is_global = FALSE`
`global`	`WHERE is_global = TRUE`
`all` (default)	`WHERE organization_id = @org_id OR is_global = TRUE`

11. Agent Communication

Files: pkg/agent/worker.go, pkg/api/agents_handler.go

Heartbeat Protocol

Endpoint: POST /heartbeat/{AgentUUID}

Request Body (optional):

{
  "capabilities": ["gather", "scrape", "helm_sync"],
  "available_credentials": ["github_token", "ssh_key"]
}

Response:

{
  "poll_interval_seconds": 30,
  "rate_limit_max_requests": 100,
  "rate_limit_window_seconds": 60
}

Server-Side:

Upserts agent into service_instances table
Stores metadata (build_date, capabilities, credentials) as JSON
Updates last_heartbeat timestamp

Capability Declaration

Agents declare supported job types during heartbeat:

Capability	Job Type
`gather`	Fetch upstream releases
`scrape`	Extract deployed versions
`helm_sync`	Discover Helm charts

Credential Declaration

Agents declare available credentials for credential-aware job assignment:

// Worker.GetAvailableCredentials()
availableCredentials := w.GetAvailableCredentials()
task, err := w.client.PollTaskWithCredentials(ctx, jobTypes, availableCredentials)

Orphan Cleanup

Service: OrphanCleanupService (runs every 2 minutes)

Logic:

Find jobs claimed by agents not in service_instances
Reset those jobs to pending status
Additional: Reset jobs in in_progress for >1 hour (stale detection)

-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending', claimed_by = NULL
WHERE claimed_by NOT IN (
    SELECT instance_uuid FROM service_instances WHERE service_id = 2
)

12. Metrics API

Files: pkg/api/metrics_handler.go, pkg/storage/queries/system_metrics.sql

Purpose

Expose system-wide operational metrics for monitoring and observability. The metrics endpoint provides a comprehensive view of system health across all organizations, including job status, alert state, service health, and task execution performance. Supports both JSON format for API consumers and Prometheus text format for integration with monitoring systems.

Endpoint

GET /api/v1/internal/metrics

Output Format

Request	Response Format	Content-Type
`/metrics` (default)	Prometheus exposition format	`text/plain; version=0.0.4; charset=utf-8`
`/metrics?format=json`	JSON	`application/json`

Default: Prometheus text format for easy integration with Prometheus scrapers.

Use ?format=json query parameter for programmatic access via JSON.

Authentication

No authentication required. This endpoint is only exposed via the internal Traefik reverse proxy (port 8443), which is restricted to trusted IPs by the hosting provider’s firewall. Security is provided by network-level access control rather than API key authentication.

Metric Categories

Organization Metrics

Metric	Description
`organizations.total`	Total number of organizations
`organizations.active`	Number of active organizations

Service Instance Metrics

Metric	Description
`services[].service_name`	Name of the service (server, agent, taskengine, etc.)
`services[].total`	Total instances of this service
`services[].healthy`	Instances with heartbeat in last 5 minutes
`services[].unhealthy`	Instances without recent heartbeat

Health Threshold: 5 minutes since last heartbeat.

Agent Metrics

Metric	Description
`agents.total`	Total registered agents (system-wide)
`agents.healthy`	Agents with heartbeat in last 5 minutes
`agents.unhealthy`	Agents without recent heartbeat

Job Status Metrics

Metric	Description
`jobs.gather`	Count of gather jobs by status (pending, in_progress, completed, failed)
`jobs.scrape`	Count of scrape jobs by status (pending, in_progress, completed, failed)
`jobs.helm_sync`	Count of helm sync jobs by status (pending, in_progress, completed, failed)

Scope: Includes all jobs across all organizations.

Alert Metrics

Metric	Description
`alerts.total`	Total alert count (system-wide)
`alerts.unacknowledged`	Alerts requiring attention
`alerts.by_severity.critical`	Critical severity alerts (unacknowledged)
`alerts.by_severity.high`	High severity alerts (unacknowledged)
`alerts.by_severity.moderate`	Moderate severity alerts (unacknowledged)

Release Metrics

Metric	Description
`releases.total`	Total tracked upstream releases
`releases.stable`	Non-prerelease versions
`releases.prerelease`	Prerelease versions (alpha, beta, rc, etc.)
`releases.unique_artifacts`	Distinct artifact names being tracked

Task Execution Metrics

Metric	Description
`task_executions.total`	Total task executions in last 24 hours
`task_executions.completed`	Successful completions
`task_executions.failed`	Failed executions
`task_executions.in_progress`	Currently running
`task_executions.success_rate`	Completion rate (0-1)

Time Window: Last 24 hours only.

API Key Metrics

Metric	Description
`api_keys.total`	Total number of API keys
`api_keys.active`	Number of active API keys
`api_keys.system`	Number of system API keys

Prometheus Metric Names

All metrics use gauge type (not counters) following Prometheus naming conventions. No org_id labels - these are system-wide metrics:

Prometheus Metric	Labels	Description
`planekeeper_organizations`	-	Total organizations
`planekeeper_organizations_active`	-	Active organizations
`planekeeper_service_instances`	`service`, `status`	Service instances by type and health
`planekeeper_agents`	-	Total agents
`planekeeper_agents_healthy`	-	Healthy agents
`planekeeper_agents_unhealthy`	-	Unhealthy agents
`planekeeper_jobs`	`type`, `status`	Jobs by type and status
`planekeeper_alerts`	-	Total alerts
`planekeeper_alerts_unacknowledged`	-	Unacknowledged alerts
`planekeeper_alerts_by_severity`	`severity`	Alerts by severity
`planekeeper_releases`	-	Total releases
`planekeeper_releases_stable`	-	Stable releases
`planekeeper_releases_prerelease`	-	Prerelease versions
`planekeeper_releases_unique_artifacts`	-	Unique artifacts
`planekeeper_task_executions_24h`	-	Task executions (24h)
`planekeeper_task_executions_24h_completed`	-	Completed tasks (24h)
`planekeeper_task_executions_24h_failed`	-	Failed tasks (24h)
`planekeeper_task_executions_24h_in_progress`	-	In-progress tasks
`planekeeper_task_executions_24h_success_rate`	-	Success rate
`planekeeper_api_keys`	-	Total API keys
`planekeeper_api_keys_active`	-	Active API keys
`planekeeper_api_keys_system`	-	System API keys
`planekeeper_audit_writer_events_written_total`	-	Total audit trail entries written (counter)
`planekeeper_audit_writer_persist_errors_total`	-	Total audit trail write failures (counter)
`planekeeper_audit_writer_insert_duration_seconds_total`	-	Cumulative audit trail insert duration in seconds (counter)

Naming Convention: Metrics use gauge type without _total suffix (per Prometheus best practices - _total is reserved for counters). Exception: audit writer metrics use counter type with _total suffix since they track cumulative totals.

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'planekeeper'
    static_configs:
      - targets: ['localhost:8443']  # Internal Traefik only
    metrics_path: '/api/v1/internal/metrics'
    # No authentication required - endpoint is on internal network
    # Default output is Prometheus format, no params needed

Important: The metrics endpoint is only accessible via the internal Traefik (port 8443). Ensure this port is restricted to trusted IPs via your hosting provider’s firewall.

Testing the Endpoint

# Prometheus format (default)
curl https://localhost:8443/api/v1/internal/metrics

# JSON format
curl https://localhost:8443/api/v1/internal/metrics?format=json

Example Prometheus Output

# HELP planekeeper_organizations Total number of organizations
# TYPE planekeeper_organizations gauge
planekeeper_organizations 5

# HELP planekeeper_organizations_active Number of active organizations
# TYPE planekeeper_organizations_active gauge
planekeeper_organizations_active 4

# HELP planekeeper_service_instances Service instances by type and status
# TYPE planekeeper_service_instances gauge
planekeeper_service_instances{service="server",status="healthy"} 2
planekeeper_service_instances{service="agent",status="healthy"} 3
planekeeper_service_instances{service="agent",status="unhealthy"} 1
planekeeper_service_instances{service="taskengine",status="healthy"} 1

# HELP planekeeper_agents Total number of registered agents
# TYPE planekeeper_agents gauge
planekeeper_agents 4

# HELP planekeeper_agents_healthy Number of healthy agents with recent heartbeat
# TYPE planekeeper_agents_healthy gauge
planekeeper_agents_healthy 3

# HELP planekeeper_agents_unhealthy Number of unhealthy agents without recent heartbeat
# TYPE planekeeper_agents_unhealthy gauge
planekeeper_agents_unhealthy 1

# HELP planekeeper_jobs Jobs by type and status
# TYPE planekeeper_jobs gauge
planekeeper_jobs{type="gather",status="pending"} 5
planekeeper_jobs{type="gather",status="completed"} 150
planekeeper_jobs{type="scrape",status="pending"} 3
planekeeper_jobs{type="scrape",status="completed"} 200
planekeeper_jobs{type="helm_sync",status="completed"} 10

# HELP planekeeper_alerts Total number of alerts
# TYPE planekeeper_alerts gauge
planekeeper_alerts 50

# HELP planekeeper_alerts_unacknowledged Number of unacknowledged alerts
# TYPE planekeeper_alerts_unacknowledged gauge
planekeeper_alerts_unacknowledged 12

# HELP planekeeper_alerts_by_severity Unacknowledged alerts by severity level
# TYPE planekeeper_alerts_by_severity gauge
planekeeper_alerts_by_severity{severity="critical"} 2
planekeeper_alerts_by_severity{severity="high"} 5
planekeeper_alerts_by_severity{severity="moderate"} 5

# HELP planekeeper_releases Total number of tracked upstream releases
# TYPE planekeeper_releases gauge
planekeeper_releases 500

# HELP planekeeper_releases_stable Number of stable (non-prerelease) releases
# TYPE planekeeper_releases_stable gauge
planekeeper_releases_stable 450

# HELP planekeeper_releases_prerelease Number of prerelease versions
# TYPE planekeeper_releases_prerelease gauge
planekeeper_releases_prerelease 50

# HELP planekeeper_releases_unique_artifacts Number of unique artifact names being tracked
# TYPE planekeeper_releases_unique_artifacts gauge
planekeeper_releases_unique_artifacts 25

# HELP planekeeper_task_executions_24h Task executions in last 24 hours
# TYPE planekeeper_task_executions_24h gauge
planekeeper_task_executions_24h 500

# HELP planekeeper_task_executions_24h_completed Completed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_completed gauge
planekeeper_task_executions_24h_completed 485

# HELP planekeeper_task_executions_24h_failed Failed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_failed gauge
planekeeper_task_executions_24h_failed 10

# HELP planekeeper_task_executions_24h_in_progress Task executions currently in progress
# TYPE planekeeper_task_executions_24h_in_progress gauge
planekeeper_task_executions_24h_in_progress 5

# HELP planekeeper_task_executions_24h_success_rate Task execution success rate (0-1)
# TYPE planekeeper_task_executions_24h_success_rate gauge
planekeeper_task_executions_24h_success_rate 0.9700

# HELP planekeeper_api_keys Total number of API keys
# TYPE planekeeper_api_keys gauge
planekeeper_api_keys 10

# HELP planekeeper_api_keys_active Number of active API keys
# TYPE planekeeper_api_keys_active gauge
planekeeper_api_keys_active 8

# HELP planekeeper_api_keys_system Number of system API keys
# TYPE planekeeper_api_keys_system gauge
planekeeper_api_keys_system 2

Example JSON Response

{
  "collected_at": "2026-02-04T12:00:00Z",
  "organizations": {
    "total": 5,
    "active": 4
  },
  "services": [
    {
      "service_name": "server",
      "total": 2,
      "healthy": 2,
      "unhealthy": 0
    },
    {
      "service_name": "agent",
      "total": 4,
      "healthy": 3,
      "unhealthy": 1
    },
    {
      "service_name": "taskengine",
      "total": 1,
      "healthy": 1,
      "unhealthy": 0
    }
  ],
  "agents": {
    "total": 4,
    "healthy": 3,
    "unhealthy": 1
  },
  "jobs": {
    "gather": {
      "pending": 5,
      "in_progress": 2,
      "completed": 150,
      "failed": 3
    },
    "scrape": {
      "pending": 3,
      "in_progress": 1,
      "completed": 200,
      "failed": 2
    },
    "helm_sync": {
      "pending": 0,
      "in_progress": 0,
      "completed": 10,
      "failed": 0
    }
  },
  "alerts": {
    "total": 50,
    "unacknowledged": 12,
    "by_severity": {
      "critical": 2,
      "high": 5,
      "moderate": 5
    }
  },
  "releases": {
    "total": 500,
    "stable": 450,
    "prerelease": 50,
    "unique_artifacts": 25
  },
  "task_executions": {
    "total": 500,
    "completed": 485,
    "failed": 10,
    "in_progress": 5,
    "success_rate": 0.97
  },
  "api_keys": {
    "total": 10,
    "active": 8,
    "system": 2
  }
}

Error Handling

Status	Condition
200	Success
500	Database query failure

Security

The metrics endpoint is unauthenticated but secured through network isolation:

Internal Traefik Only: The endpoint is registered on the internal API path (/api/v1/internal/metrics) and only exposed via the internal Traefik reverse proxy on port 8443, which is restricted to trusted IPs by the hosting provider’s firewall.
Not Publicly Routed: The public Traefik (dynamic-public.yml) does not include routing rules for /api/v1/internal/metrics.
Access Methods: To access the metrics endpoint:
- Direct (with firewall rule): curl https://<server-ip>:8443/api/v1/internal/metrics
- From the same host: curl https://localhost:8443/api/v1/internal/metrics
- Via SSH tunnel: ssh -L 8443:localhost:8443 user@server then curl https://localhost:8443/api/v1/internal/metrics
- Via VPN: Route traffic to the internal port

Side Effects

None - this is a read-only endpoint.

Additional URLs

https://o11y.tools/metricslint/ https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md

13. Developer Tools

API Documentation (Swagger UI)

The server hosts interactive API documentation interfaces using Swagger UI, with separate documentation for client and internal APIs.

Endpoints:

Endpoint	Description
`/api/v1/swagger`	Client API Swagger UI (org-scoped endpoints)
`/api/v1/internal/swagger`	Internal API Swagger UI (system/agent endpoints)
`/api/spec/openapi-client.yaml`	Client API OpenAPI specification
`/api/spec/openapi-internal.yaml`	Internal API OpenAPI specification
`/api/spec/openapi-shared.yaml`	Shared paths referenced by both specs
`/api/spec/openapi.yaml`	Combined specification (for codegen)
`/api/spec/components/*`	Shared component schema files

API Separation:

API	Base Path	Purpose	Endpoints
Client	`/api/v1/client`	Organization-scoped operations	Jobs, releases, rules, alerts, dropdowns, org settings
Internal	`/api/v1/internal`	System/agent operations	Heartbeat, tasks, metrics, global settings

Client API includes:

Job management (gather, scrape, helm-sync)
Releases and versions viewing
Rules and alert management
Alert configurations
Dropdown data for UI forms
Organization-specific settings overrides

Internal API includes:

Agent heartbeat registration
Task polling and completion (for agents)
Prometheus metrics endpoint
Global settings management

Shared Paths (available in both APIs):

/gather-jobs/*, /scrape-jobs/*, /helm-sync-jobs/*
/releases/*, /versions
/settings (GET only in client, GET+PUT in internal)
/agents, /validate/regex

Usage:

Start the server: go run ./app/server
Navigate to:
- http://localhost:3000/api/v1/swagger for client API docs
- http://localhost:3000/api/v1/internal/swagger for internal API docs
Click “Authorize” and enter your API key (pk_<id>_<secret>)
Use “Try it out” on any endpoint to execute requests

Implementation (app/server/main.go):

// Serve API specs for Swagger UI
app.Static("/api/spec", "./api")

// Client API Swagger UI at /api/v1/swagger
app.Static("/api/v1/swagger", "./internal/static/swagger-client")

// Internal API Swagger UI at /api/v1/internal/swagger
app.Static("/api/v1/internal/swagger", "./internal/static/swagger-internal")

OpenAPI Spec Structure:

api/
├── openapi.yaml              # Master spec for codegen (references all others)
├── openapi-shared.yaml       # Shared paths (no duplication)
├── openapi-client.yaml       # Client API (references shared + client-only)
├── openapi-internal.yaml     # Internal API (references shared + internal-only)
└── components/               # Shared schemas, parameters, responses
    ├── schemas.yaml
    ├── parameters.yaml
    ├── responses.yaml
    └── securitySchemes.yaml

Code Generation: The bazel run //toolchains/oapi-codegen target bundles openapi.yaml (which references all specs) to generate server handlers and client code. The split specs are used only for Swagger UI documentation.

Files:

internal/static/swagger-client/index.html - Client API Swagger UI
internal/static/swagger-internal/index.html - Internal API Swagger UI
api/openapi.yaml - Master specification for codegen
api/openapi-client.yaml - Client API specification
api/openapi-internal.yaml - Internal API specification
api/openapi-shared.yaml - Shared paths (single source of truth)
api/components/ - Shared schema, parameter, and response definitions

14. Admin UI

Files: internal/handlers/, internal/templates/, internal/services/api_client.go, internal/middleware/

Purpose

The Admin UI provides server-rendered HTML interfaces for managing all Planekeeper resources. Two UI binaries exist — clientui (organization-scoped, public-facing) and internalui (system-scoped, admin-only). Both consume the REST API through an HTTP client wrapper and render pages using templ templates with Tailwind CSS.

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Browser    │────▶│   Handler    │────▶│  API Client   │────▶│   REST API   │
│              │     │  (Fiber)     │     │  (services)   │     │   (Server)   │
└──────────────┘     └──────┬───────┘     └──────────────┘     └──────────────┘
                            │
                     ┌──────┴───────┐
                     │    templ     │
                     │  Templates   │
                     │  (pages +    │
                     │  components) │
                     └──────────────┘

The UI never accesses the database directly. All data flows through the REST API via internal/services/api_client.go.

14.1 Authentication

The two UI binaries use different authentication methods:

UI	Auth Method	Cookie	Description
ClientUI	Supabase Auth (preferred) or API Key (legacy)	`planekeeper_session` + `planekeeper_org`	Human users via email/password or OAuth
InternalUI	API Key only	`planekeeper_api_key`	Admin users via system API key

When Supabase is not configured (SUPABASE_JWT_SECRET not set), ClientUI falls back to API key login — the same flow as InternalUI.

Supabase Auth (ClientUI)

Login/Signup Options:

Email/Password: Client-side Supabase JS SDK handles signInWithPassword (login) and signUp (registration), then exchanges tokens with the server via POST /auth/token-exchange
OAuth (auto-detected): OAuth buttons appear on both login and signup pages. On startup, ClientUI calls GET /auth/v1/settings to discover which providers are enabled in the Supabase project (e.g., GitHub, Google). Only enabled providers are shown. Supabase’s signInWithOAuth handles both sign-in and sign-up automatically — users who don’t have an account are created on first OAuth login.
Provider auto-detection: The OAuthProviders []string field on UIConfig is populated at startup and passed to both login and signup page templates. If the settings endpoint is unreachable, OAuth buttons are silently omitted.

Session Cookies:

planekeeper_session: AES-GCM encrypted blob containing access_token, refresh_token, expiry, user_id, email, supabase_id, is_approved. HTTP-only, 7-day expiry.
planekeeper_org: Active organization ID (plain int64). Validated against membership on every request.

Login Flow (email/password):

POST /auth/login (email + password)
    │
    ▼
Server calls Supabase /auth/v1/token?grant_type=password
    │
    ▼
Find or create user in DB (by supabase_id or email)
    │  (new users created with is_approved = FALSE)
    │
    ▼
Set planekeeper_session cookie (encrypted, includes is_approved)
    │
    ├── User approved (is_approved = TRUE)?
    │   ├── NO  → Redirect to /pending-approval
    │   └── YES ↓
    │
    ├── User has org memberships?
    │   ├── YES → Set planekeeper_org cookie → Redirect to /dashboard
    │   └── NO  → Redirect to /onboarding

OAuth Flow (client-side, via Supabase JS SDK):

User clicks OAuth button on /login or /auth/signup
    │
    ▼
Supabase JS SDK calls signInWithOAuth({ provider, redirectTo })
    │
    ▼
Browser redirects to Supabase /auth/v1/authorize → provider (e.g., GitHub)
    │
    ▼
User authenticates with provider
    │
    ▼
Provider redirects back to Supabase → Supabase redirects to AUTH_CALLBACK_URL
    │
    ▼
GET /auth/callback (tokens in URL hash)
    │
    ▼
Callback page JS extracts tokens → POST /auth/token-exchange
    │
    ▼
Server validates JWT, finds/creates user → Set session → Dashboard or onboarding

Onboarding (first login, no org memberships):

GET /onboarding
    │
    ├── Check pending invites by email
    │
    ▼
Show "Create Organization" form + pending invite list
    │
    ├── POST /onboarding/create-org → Create org + owner membership → Dashboard
    └── POST /onboarding/accept-invite/:token → Create membership → Dashboard

Token Refresh: The auth middleware transparently refreshes expired JWTs using the stored refresh token. Updated tokens are re-encrypted and written back to the session cookie.

Middleware (pkg/auth/middleware.go):

RequireAuth: Validates session cookie, refreshes expired JWT, sets context locals (user_id, email, supabase_id)
RequireOnboarded: Checks user is approved (redirects to /pending-approval if not), then checks org membership and active org cookie (redirects to /onboarding if not)

API Key Auth (InternalUI + Legacy ClientUI)

Login Flow:

User navigates to /login (unauthenticated)
Enters API key in form
Handler validates key, stores in HTTP-only cookie planekeeper_api_key (24-hour expiry)
All subsequent requests include cookie automatically

Middleware (internal/middleware/api_key.go):

Checks X-API-Key header first, then planekeeper_api_key cookie
On failure: redirects browser requests to login page, returns 401 for API requests
ClientUI requires organization-scoped keys
InternalUI requires system-scoped keys (organization_id = 0)

API Client Construction

All handlers use a shared helper to construct API clients from the current auth context:

getAPIClient(c, cfg)
    │
    ├── Has Supabase session? → NewAPIClientWithJWT(accessToken, orgID)
    │   (uses Authorization: Bearer + X-Organization-Id headers)
    │
    └── Has API key? → NewAPIClient(apiKey)
        (uses X-API-Key header)

This dual-path helper allows all handlers to work identically regardless of auth method.

Client UI (organization-scoped resources):

Section	Page	Route
Overview	Dashboard	`/dashboard`
Jobs	Gather Jobs	`/jobs`
Jobs	Scrape Jobs	`/scrape-jobs`
Data	Releases	`/releases`
Rules	Monitoring Rules	`/rules`
Rules	Alert Configs	`/alert-configs`
Monitoring	Alerts	`/alerts`
Notifications	Channels	`/notification-channels`
Notifications	Rules	`/notification-rules`
Notifications	Settings	`/notification-settings`
Notifications	Deliveries	`/notification-deliveries`

Internal UI (global/system resources):

Section	Page	Route
Jobs	Gather Jobs	`/jobs`
System	Agents	`/agents`
System	Settings	`/settings`

14.3 Page Patterns

Every list page follows a consistent pattern:

┌─────────────────────────────────────────┐
│  Title                    [Create] btn  │
├─────────────────────────────────────────┤
│  Success/Error banner (from query param)│
├─────────────────────────────────────────┤
│  Inline create form (if ?new=true)      │
├─────────────────────────────────────────┤
│  ┌─────────────────────────────────┐    │
│  │  Table with headers             │    │
│  │  Row 1  |  Row 2  |  Actions   │    │
│  │  ...                            │    │
│  │  "No items found" (if empty)    │    │
│  ├─────────────────────────────────┤    │
│  │  Pagination (Prev | Next)       │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘

Every detail/edit page follows:

┌─────────────────────────────────────────┐
│  ← Back to List         Page Title      │
├─────────────────────────────────────────┤
│  Success/Error banner                   │
├─────────────────────────────────────────┤
│  Edit form with fields                  │
│  [Cancel]  [Save Changes]              │
├─────────────────────────────────────────┤
│  Danger Zone: [Delete]                  │
├─────────────────────────────────────────┤
│  Created: YYYY-MM-DD | Updated: ...     │
└─────────────────────────────────────────┘

14.4 Form Handling Flow

Standard CRUD Flow:

GET /resource?new=true          → Render list page with inline create form
POST /resource                  → Parse form, call API, redirect with ?success= or ?error=
GET /resource/{id}              → Render detail/edit page
POST /resource/{id}             → Parse form, call API, redirect with result
POST /resource/{id}/delete      → Call API, redirect to list with result
POST /resource/{id}/toggle      → Toggle active state (HTMX or redirect)

Error Message Flow:

API Response (JSON)  →  Services Layer (extract fields)  →  Handler (format message)  →  URL redirect (?error=encoded_msg)

Errors are passed as URL query parameters for stateless handling across redirects:

/notification-channels/1?error=Delivery+failed+with+HTTP+400
/rules?success=Rule+created+successfully

Error Sanitization (internal/handlers/form_helpers.go):

Internal errors (connection refused, context deadline, database constraint violations) are mapped to generic user-friendly messages
Message length capped at 100 characters
Prevents leaking sensitive database/system details to the UI

Graceful Degradation: When API calls fail, pages render with empty data rather than showing error pages. The dashboard, for example, renders empty metrics if GetDashboardStats fails.

14.5 HTMX Integration

Several pages use HTMX for partial page updates without full reloads:

Feature	Pattern	Pages
Active toggle	`hx-post` to toggle endpoint, `hx-target` replaces row, `hx-swap="outerHTML"`	Rules, alert configs, notification channels, notification rules
Settings edit	`hx-get` loads inline edit form, `hx-put` saves, `hx-target` replaces cell	Settings
Alert acknowledge	`hx-post` to acknowledge endpoint, replaces alert row	Alerts

Toggle Handler Logic:

if request has "HX-Request" header:
    → render and return updated table row only (outerHTML swap)
else:
    → redirect to list page with success message

14.6 Scope Filtering

The dashboard and jobs pages support multi-tenant scope filtering via a dropdown:

Scope	Label	Behavior
`org`	Organization	Shows only the current organization’s resources
`global`	Global	Shows only globally-shared resources
`all`	All	Shows both organization and global resources

Default Scopes:

Dashboard defaults to org (most relevant view for operators)
Jobs page defaults to org (consistent with dashboard; operators can switch to “all” to see global jobs)

The scope parameter is passed through to the API’s list endpoints which apply the corresponding SQL filter (see Section 10: Multi-Tenancy Model).

14.7 Pagination

List pages support offset-based pagination via query parameters:

Parameter	Default	Validation
`limit`	50	Must be 1-100
`offset`	0	Must be ≥ 0

The pagination component renders Previous/Next links and a “Showing X-Y” counter. It determines whether a “Next” link is needed by checking if itemCount == limit (indicating more items may exist).

14.8 Shared Component Library

Reusable templ components in internal/templates/components/ enforce consistent UI patterns across all pages:

Component	Purpose	Business Logic
`ScopeFilter`	Scope dropdown + optional search	Renders All/Organization/Global options, preserves current selection
`ActiveToggle`	HTMX toggle badge	Green “Active” / gray “Inactive”, posts to toggle endpoint
`ActionCell`	Edit + Delete buttons	Edit link + delete form with confirmation dialog
`EmptyRow`	Empty state message	Spans configurable number of table columns
`FormButtons`	Cancel + Submit pair	Cancel links back to list, submit posts form
`DetailPageHeader`	Back link + title	Consistent navigation on detail pages
`Timestamps`	Created/Updated footer	Formats as `YYYY-MM-DD HH:MM:SS`
`FormCard`	Card wrapper with title	White card with close button, uses slot pattern
`Pagination`	Page navigation	Previous/Next with offset arithmetic
`MetricCard`	Dashboard stat card	Large number with label and icon
`StatusBadge`	Job status indicator	Color-coded: pending=amber, in_progress=blue, completed=green, failed=red
`SeverityBadge`	Alert severity indicator	critical=red, high=orange, moderate=yellow
`RuleTypeBadge`	Rule type indicator	days_behind=purple, majors_behind=indigo, minors_behind=cyan
`HealthBadge`	Agent health indicator	green “Healthy” / red “Unhealthy”
`ChannelTypeBadge`	Channel type indicator	webhook=purple, pagerduty=green, telegram=blue, smtp=yellow
`DeliveryStatusBadge`	Delivery status indicator	pending=yellow, succeeded=green, failed=red, dead_letter=dark
`EventTypeBadge`	Alert event indicator	created=blue, escalated=orange, acknowledged=green, resolved=purple
`BoolBadge`	Generic yes/no indicator	Configurable true/false labels with green/gray colors
`ErrorMessage`	Error banner	Red banner, renders only when message is non-empty
`SuccessMessage`	Success banner	Green banner, renders only when message is non-empty

14.9 Page-Specific Business Logic

Dashboard: Aggregates job statistics (total, pending, completed) and system health into four metric cards. Displays recent gather jobs in a table sorted by creation date.

Gather Jobs: Form validation includes name, artifact name, source type, cron schedule, tag filter, and version regex. Supports “Run Now” to trigger immediate execution and “Clear Releases” to purge cached upstream data.

Scrape Jobs: Displays the latest version snapshot inline with each job row. Form includes regex validation endpoint. Version history page shows historical snapshots with configurable limit (1-20).

Releases: Two view modes — flat list and grouped by artifact. Supports filters for artifact name (autocomplete from known artifacts), version text, sort order (newest/oldest first), and prerelease inclusion toggle. Summary bar shows total count, unique artifacts, and stable release count.

Rules: Three rule types with tiered thresholds (moderate ≤ high ≤ critical). Threshold values are contextual — “days” for days_behind, “versions” for majors/minors_behind. “Evaluate All” button triggers rule evaluation across all active alert configs.

Alert Configs: Links three resources (scrape job + gather job + rule) into a monitoring configuration. Form dropdowns are populated dynamically from available resources. Displays rule type badge alongside rule name.

Alerts: Filters by acknowledgment status and severity. Summary panel shows counts by severity. Supports single and bulk acknowledgment. Table rows are color-coded by severity (red border for critical, orange for high, yellow for moderate).

Notification Channels: Channel detail page includes “Test Channel” button that sends a test webhook and displays detailed results (HTTP status, response preview, latency). Event-specific template editor with toggle to enable/disable, showing inherited global defaults as collapsible previews.

Notification Settings: Organization-level default channel selection. Per-category template management (new_alert, acknowledged, resolved) with “Reset to Global” option for organization overrides.

Settings: Combined view showing global defaults alongside organization overrides. Inline HTMX editing — click “Set Override” to enter edit mode, “Clear” to remove override and fall back to global default.

15. Open Questions & Ambiguities

1. Retry Exhaustion Recovery

Issue: When a job reaches max_attempts and enters failed status, there is no automatic recovery mechanism.

Current Behavior: Job remains in failed status indefinitely until manual intervention.

Possible Solutions:

Manual trigger via TriggerJob endpoint
Admin UI “retry” button
Automatic reset after configurable cooldown period

2. Token Expiry vs Stale Reset Overlap

Issue: Execution tokens expire after ~5 minutes (configurable), but stale job reset happens after 1 hour.

Scenario:

Agent claims job, gets 5-minute token
Agent crashes at minute 3
Token expires at minute 5
Job remains in_progress until minute 60 (stale reset)

Question: Should there be intermediate recovery (e.g., 15-minute token expiry detection)?

3. Global Jobs Organization Association

Issue: Global jobs (is_global = true) have organization_id = NULL, but releases created by these jobs use organization_id = 1 (Global org).

Implication: Query logic must account for both NULL and 1 when listing global releases.

4. Minors Behind Formula Discrepancy

Issue: The formula-based calculation for minors_behind may differ from release-list counting.

Example:

Formula: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind
Release-list might show: 7.0, 7.1, 7.2, 8.0, 8.1 = 5 minors behind

Question: Should formula fallback be documented as “approximate” in alerts?

5. Alert History Retention

Current Behavior: Resolved alerts are soft-deleted and preserved indefinitely.

Potential Improvements:

Add configurable retention period (e.g., 90 days)
Add PurgeOldResolvedAlerts cleanup job to taskengine
Consider archival to separate table for very old alerts

Note: The PurgeOldResolvedAlerts query exists but is not currently called by any scheduled job.

16. Regression Test Recommendations

Gather Jobs Domain

Test Case	Description
`TestGatherJob_GitHub_RateLimitHandling`	Verify proper error message and retry behavior on 403/429
`TestGatherJob_GitHub_Pagination`	Ensure all 1000 releases fetched across 10 pages
`TestGatherJob_Helm_LargeIndex`	Verify 50MB limit enforced, graceful error
`TestGatherJob_VersionRegex_CaptureGroup`	Confirm capture group extraction vs full match
`TestGatherJob_StateTransition_MaxAttempts`	Verify pending→failed after max_attempts
`TestGatherJob_Reschedule_CronExpression`	Verify next_run_at calculation accuracy
`TestGatherJob_OrphanRecovery`	Jobs released when agent disconnects

Scrape Jobs Domain

Test Case	Description
`TestScrapeJob_CredentialAssignment`	Only agents with credential receive job
`TestScrapeJob_Parser_YQ_ArrayIndexing`	`.dependencies[0].version` works
`TestScrapeJob_Parser_Regex_NoMatch`	Graceful error when pattern doesn’t match
`TestScrapeJob_VersionTransform_All`	All 5 transforms work correctly
`TestScrapeJob_HistoryLimit_Cleanup`	Old snapshots deleted when limit exceeded
`TestScrapeJob_TriggerRuleEvaluation`	Async rule evaluation triggered on success

Task Execution System

Test Case	Description
`TestTask_SkipLocked_ConcurrentClaim`	Two agents don’t get same job
`TestTask_TokenExpiry_ReturnsConflict`	409 returned for expired token
`TestTask_IdempotentCompletion`	Same token submitted twice returns 202
`TestTask_ResultProcessingFailure`	Agent gets 202 even if processing fails
`TestTask_CapabilityFiltering`	Agent without capability doesn’t receive job type

Rules Engine

Test Case	Description
`TestRule_DaysBehind_VersionNotFound`	CRITICAL with -1 behindBy
`TestRule_MajorsBehind_VersionParseFail`	CRITICAL on semver error
`TestRule_MinorsBehind_ReleaseListVsFormula`	Both methods produce reasonable results
`TestRule_StableOnly_SkipsPrerelease`	alpha/beta/rc excluded from latest
`TestRule_ThresholdTiers_HighestWins`	Critical returned when >= critical threshold

Alert System

Test Case	Description
`TestAlert_OnePerConfig`	Only one active alert per config allowed
`TestAlert_UpdateInPlace`	Version change updates existing alert
`TestAlert_AckResetOnVersionChange`	Ack cleared only when discovered version changes
`TestAlert_AckPreservedOnSameVersion`	Ack preserved when same version re-evaluated
`TestAlert_SoftDelete_SetsResolvedAt`	Resolution sets resolved_at, doesn’t delete
`TestAlert_SoftDelete_PreservesHistory`	Resolved alerts accessible via /alerts/resolved
`TestAlert_SoftDelete_NewAlertAfterResolve`	New alert can be created after previous resolved
`TestAlert_AutoTrigger_OnConfigCreate`	Evaluation runs after config creation
`TestAlert_AutoTrigger_OnScrapeSuccess`	Evaluation runs after scrape completes
`TestAlert_AutoTrigger_OnGatherSuccess`	Evaluation runs after gather completes
`TestAlert_Resolution_NotifiesWebhook`	alert.resolved event dispatched on resolve
`TestAlert_BulkAcknowledge_OnlyActive`	Bulk ack only affects active alerts
`TestAlert_ListResolved_Pagination`	/alerts/resolved supports limit/offset
`TestAlert_ListResolved_SeverityFilter`	/alerts/resolved filters by severity

Notification System

Test Case	Description
`TestNotification_RuleMatching_SeverityFilter`	Rule with severity filter only matches specified severities
`TestNotification_RuleMatching_EventFilter`	Rule with event filter only matches specified events
`TestNotification_RuleMatching_CatchAll`	Empty filters match all severities and events
`TestNotification_RuleMatching_Priority`	Higher priority rules evaluated first
`TestNotification_Dispatch_ChannelDedup`	Same alert doesn’t notify same channel twice
`TestNotification_Dispatch_OrgDefault`	Falls back to org default when no rule matches
`TestNotification_Dispatch_RepeatInterval`	Skips notification within repeat interval
`TestNotification_Delivery_SkipLocked`	Multiple notifiers don’t claim same delivery
`TestNotification_Delivery_StatusTransitions`	Correct state machine: pending→in_progress→succeeded/failed
`TestNotification_Retry_ExponentialBackoff`	Delays increase with attempts
`TestNotification_Retry_RetryAfterHeader`	Honors 429 Retry-After header
`TestNotification_Retry_DeadLetter`	Moves to dead_letter after max attempts
`TestNotification_Retry_NonRetryable4xx`	4xx (except 429) goes to dead_letter immediately
`TestNotification_Webhook_HMACSignature`	Correct HMAC-SHA256 signature in header
`TestNotification_Webhook_IdempotencyKey`	Same key across retries
`TestNotification_Webhook_SSRFProtection`	Private IPs blocked by default
`TestNotification_Ack_TokenValidation`	Token lookup, expiry, and single-use
`TestNotification_Ack_AlertUpdate`	Alert marked acknowledged via callback
`TestNotification_Housekeeping_ExpireTokens`	Expired tokens deleted
`TestNotification_Housekeeping_CleanupDeliveries`	Stuck deliveries moved to dead_letter

Multi-Tenancy

Test Case	Description
`TestTenancy_OrgIsolation`	Org A can’t see Org B resources
`TestTenancy_GlobalVisibility`	All orgs see global resources
`TestTenancy_SystemKeyCreatesGlobal`	System key creates is_global=true
`TestTenancy_ScopeParameter`	org/global/all filters work correctly

Agent Communication

Test Case	Description
`TestAgent_Heartbeat_UpdatesLastSeen`	Timestamp updated on heartbeat
`TestAgent_Heartbeat_StoresCapabilities`	Metadata includes capabilities JSON
`TestAgent_OrphanCleanup_DisconnectedAgent`	Jobs reset when agent removed
`TestAgent_OrphanCleanup_StartupDelay`	No cleanup in first 30 seconds

Metrics API

Metrics api should comply with the OpenMetrics standard. See: https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md

Test Case	Description
`TestMetrics_ContentNegotiation_JSON`	Default Accept header returns JSON
`TestMetrics_ContentNegotiation_Prometheus`	Accept: text/plain returns Prometheus format
`TestMetrics_NoAuthRequired`	Endpoint accessible without API key
`TestMetrics_SystemWideOrganizationCounts`	Returns total and active org counts
`TestMetrics_ServiceInstancesByType`	Returns service instances grouped by type
`TestMetrics_JobCountsAllTypes`	Includes gather, scrape, and helm_sync jobs
`TestMetrics_AgentHealth_5MinuteThreshold`	Agents without heartbeat in 5 min marked unhealthy
`TestMetrics_TaskExecutions_24HourWindow`	Only includes executions from last 24 hours
`TestMetrics_TaskExecutions_SuccessRate`	Calculates success rate correctly
`TestMetrics_APIKeyCounts`	Returns total, active, and system key counts
`TestMetrics_PrometheusFormat_NoOrgLabels`	Prometheus output has no org_id labels
`TestMetrics_PrometheusFormat_HelpAndType`	All metrics have HELP and TYPE declarations

Admin UI

Test Case	Description
`TestUI_Login_ValidKey`	Valid API key sets cookie and redirects to dashboard
`TestUI_Login_InvalidKey`	Invalid key shows error on login page
`TestUI_Logout_ClearsCookie`	Logout clears auth cookie and redirects to login
`TestUI_Dashboard_DefaultScope`	Dashboard defaults to org scope
`TestUI_Dashboard_ScopeFilter`	Scope parameter filters jobs correctly
`TestUI_Dashboard_GracefulDegradation`	Dashboard renders with empty data when API fails
`TestUI_ListPage_Pagination`	Limit/offset query params paginate results
`TestUI_ListPage_EmptyState`	Empty table shows “No items found” message
`TestUI_CreateForm_Validation`	Missing required fields redirect with error message
`TestUI_CreateForm_Success`	Valid form redirects with success message
`TestUI_Toggle_HTMX`	Toggle with HX-Request header returns updated row only
`TestUI_Toggle_NonHTMX`	Toggle without HTMX redirects to list page
`TestUI_Delete_Confirmation`	Delete form includes confirmation dialog
`TestUI_ErrorSanitization`	Internal errors mapped to generic user-friendly messages

Appendix: Key SQL Functions Reference

Gather Jobs

Function	Purpose
`CreateGatherJob`	Insert new job with defaults
`ClaimNextPendingGatherJob`	Atomic claim with SKIP LOCKED
`CompleteGatherJob`	Mark successful completion
`FailGatherJob`	Increment attempts, possibly transition to failed
`RescheduleGatherJob`	Calculate and set next_run_at
`ResetStaleGatherJobs`	Reset jobs in_progress > 1 hour
`ResetOrphanedGatherJobs`	Reset jobs claimed by dead agents

Scrape Jobs

Function	Purpose
`CreateScrapeJob`	Insert new job with defaults
`ClaimNextPendingScrapeJobWithCredentials`	Credential-aware atomic claim
`CompleteScrapeJob`	Mark successful completion
`FailScrapeJob`	Increment attempts, possibly transition to failed
`RescheduleScrapeJob`	Calculate and set next_run_at

Version Snapshots

Function	Purpose
`CreateVersionSnapshot`	Insert new snapshot (no ON CONFLICT - full history)
`GetVersionSnapshot`	Get snapshot by ID
`GetVersionSnapshotByScrapeJob`	Get latest snapshot for job (ORDER BY id DESC)
`GetLatestVersionSnapshot`	Get latest version for rules evaluation (ORDER BY id DESC)
`ListVersionHistoryByScrapeJob`	List version history with limit
`CountVersionSnapshotsByScrapeJob`	Count snapshots for history display
`DeleteOldVersionSnapshots`	Purge snapshots beyond retention limit

Note: All “latest” queries use ORDER BY id DESC instead of ORDER BY discovered_at DESC to guarantee insertion order regardless of timestamp precision.

Alerts

Function	Purpose
`UpsertAlert`	Create or update active alert (one per config)
`GetAlertByID`	Get alert with joined config/rule data
`GetAlertByConfigID`	Get active alert for specific config
`ListAlertsByOrganization`	List active alerts with filters
`CountAlertsByOrganization`	Count active alerts with filters
`ListResolvedAlerts`	List resolved (historical) alerts
`CountResolvedAlerts`	Count resolved alerts
`AcknowledgeAlert`	Mark active alert as acknowledged
`BulkAcknowledgeAlerts`	Batch acknowledge by ID array
`UnacknowledgeAlert`	Clear acknowledgement on active alert
`ResolveAlert`	Soft delete: set resolved_at timestamp
`ResolveAlertsForConfig`	Soft delete all active alerts for config
`GetAlertSummary`	Count active alerts by severity and ack status
`PurgeOldResolvedAlerts`	Hard delete resolved alerts older than N days

Notification Channels

Function	Purpose
`CreateNotificationChannel`	Insert new channel with config
`GetNotificationChannel`	Get channel by ID with org check
`ListNotificationChannels`	List channels for organization
`UpdateNotificationChannel`	Update channel settings
`DeleteNotificationChannel`	Remove channel
`ToggleNotificationChannel`	Toggle active state

Notification Rules

Function	Purpose
`CreateNotificationRule`	Insert new rule with filters
`GetNotificationRule`	Get rule by ID with org check
`ListNotificationRules`	List rules for organization (ordered by priority)
`UpdateNotificationRule`	Update rule settings
`DeleteNotificationRule`	Remove rule
`ToggleNotificationRule`	Toggle active state
`GetMatchingRulesForAlert`	Get rules matching severity and event type

Notification Deliveries

Function	Purpose
`CreateNotificationDelivery`	Insert new delivery record
`ClaimPendingDeliveries`	Atomic claim with SKIP LOCKED
`MarkDeliverySucceeded`	Update status to succeeded
`MarkDeliveryFailed`	Update status and schedule retry
`ListDeliveriesByOrganization`	List deliveries with filters
`ListDeliveriesByAlert`	List deliveries for specific alert
`ListDeadLetters`	List dead letter deliveries
`ResetDeliveryForRetry`	Reset dead letter for retry
`RecentSuccessInGroup`	Check for recent success in group interval

Notification Ack Tokens

Function	Purpose
`CreateAckToken`	Insert new acknowledgment token
`GetAckTokenByToken`	Lookup token by value
`MarkAckTokenUsed`	Mark token as used
`CleanupExpiredAckTokens`	Delete expired tokens

System Metrics

Function	Purpose
`GetSystemMetricsOrganizationCounts`	Total and active organization counts
`GetSystemMetricsServiceInstances`	Service instances by type with health status
`GetSystemMetricsAgentCounts`	Agent counts with health status (system-wide)
`GetSystemMetricsGatherJobCounts`	Gather job counts by status (system-wide)
`GetSystemMetricsScrapeJobCounts`	Scrape job counts by status (system-wide)
`GetSystemMetricsHelmSyncJobCounts`	Helm sync job counts by status (system-wide)
`GetSystemMetricsAlertSummary`	Alert counts with severity breakdown (system-wide)
`GetSystemMetricsReleaseSummary`	Release counts (system-wide)
`GetSystemMetricsTaskExecutions`	Task execution stats for last 24 hours (system-wide)
`GetSystemMetricsAPIKeyCounts`	API key counts (total, active, system)

Document generated: 2026-02-05 Source: Planekeeper codebase analysis

Table of Contents

1. Executive Summary

System Purpose

Architecture Triad

Core Workflow

2. Gather Jobs Domain

Purpose

Source Types

Input Parameters

Business Rules

State Machine

Error Handling

Side Effects

Incremental Sync (GitHub)

3. Scrape Jobs Domain

Purpose

Input Parameters

Parser Implementations

YQ Parser (YAML)

JQ Parser (JSON)

Regex Parser (Text)

Manual Entry (No Agent)

Version Transforms

Credential-Aware Assignment

Git Clone Optimization

State Machine

Side Effects

Version Snapshot Storage

Full History Tracking

Database Schema

Query Pattern

Rule Evaluation Timing

4. Helm Sync Jobs Domain

Purpose

Input Parameters

Business Rules

State Machine

Side Effects

5. Task Execution System

Purpose

Token Lifecycle

Assignment Flow

Completion Flow

Error States

Idempotency Guarantees

6. Background Scheduler

Processing Loops

Timeout Handling

Scheduled Activation

Orphan Detection

7. Rules Engine

Rule Types

Threshold Tiers

Evaluation Algorithm

Minors Behind Calculation

Stable-Only Filtering

Special Cases

8. Alert System

Core Concepts

Alert Lifecycle

Database Schema

Upsert Behavior

Alert Resolution (Soft Delete)

Audit Trail (Alert Actions)

Auto-Triggers (Event-Driven)

Alert Config Composition

API Endpoints

Notification Events

Event-Driven Alert Service

Event Bus System

9. Notification System

Purpose

Architecture

9.1 Channel Types

9.2 Notification Channels

9.2.1 Event-Specific Templates

9.2.2 Template Variables

9.2.3 Platform-Specific Examples

9.2.4 Configuring Templates

9.2.5 Template Best Practices