Planekeeper Business Logic Documentation
A comprehensive guide to the business logic, workflows, and domain rules that power Planekeeper’s automated software version staleness detection system.
Table of Contents
- Executive Summary
- Gather Jobs Domain
- Scrape Jobs Domain
- Helm Sync Jobs Domain
- Task Execution System
- Background Scheduler
- Rules Engine
- Alert System
- Notification System
- Multi-Tenancy Model
- Agent Communication
- Metrics API
- Developer Tools
- Admin UI
- Open Questions & Ambiguities
- Regression Test Recommendations
1. Executive Summary
System Purpose
Planekeeper is an automated software version staleness detection system. It monitors deployed software versions against upstream releases to identify when software falls behind, applying configurable rules to generate severity-graded alerts.
Architecture Triad
┌─────────────────────────────────────────────────────────────────┐
│ SERVER │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌────────────────┐ │
│ │ REST API │ │ Admin UI │ │ Heartbeat │ │ Orphan Cleanup │ │
│ │ (Fiber) │ │ (templ) │ │ Service │ │ Service │ │
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ └───────┬────────┘ │
│ └──────────────┴─────────────┴────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ PostgreSQL │ │
│ │ (sqlc/goose) │ │
│ └────────┬────────┘ │
└────────────────────────────┼────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ AGENT 1 │ │ TASKENGINE │ │ AGENT N │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Gatherer │ │ │ │ Scheduler │ │ │ │ Scraper │ │
│ │ (GitHub) │ │ │ │ (cron) │ │ │ │ (git+YQ) │ │
│ └───────────┘ │ │ ├───────────┤ │ │ └───────────┘ │
│ ┌───────────┐ │ │ │ Processor │ │ └───────────────┘
│ │ Scraper │ │ │ │ (results) │ │
│ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘
Server: Hosts REST API, Admin UI, runs database migrations, manages heartbeat detection and orphan cleanup.
Agent: Polls for tasks, executes gather (fetch upstream releases) and scrape (extract deployed versions) jobs.
TaskEngine: Handles job scheduling, timeout detection, cron-based rescheduling, and result processing.
Core Workflow
┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐
│ GATHER JOBS │────▶│ SCRAPE JOBS │────▶│ RULES ENGINE │────▶│ ALERTS │
│ │ │ │ │ │ │ │
│ Fetch latest │ │ Extract │ │ Calculate │ │ Create/ │
│ releases │ │ deployed │ │ behind-by │ │ update │
│ from GitHub/ │ │ versions │ │ values │ │ alerts │
│ Helm repos │ │ from repos │ │ │ │ │
└──────────────┘ └───────────────┘ └──────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
upstream_releases version_snapshots alert_configs alerts
(table) (table) (table) (table)
2. Gather Jobs Domain
Files: pkg/gatherer/github.go, pkg/gatherer/helm.go, pkg/gatherer/oci.go, pkg/gatherer/endoflife.go, pkg/api/gather_jobs_handler.go, pkg/storage/queries/gather_jobs.sql
Purpose
Fetch upstream releases from external sources to establish the “latest available version” baseline for staleness detection. ~173 global gather jobs are pre-seeded (migration 034) to provide immediate coverage of common infrastructure software.
Source Types
| Source Type | Example Artifact Name | Description |
|---|---|---|
github_releases | github.com/kubernetes/kubernetes | GitHub releases via REST API |
helm_repository | argoproj.github.io/argo-helm/argo-cd | Helm chart versions via index.yaml |
oci_registry | docker.io/library/nginx | OCI container image tags via registry API |
endoflife_date | endoflife.date/python | Product lifecycle data from endoflife.date |
Input Parameters
| Field | Required | Description |
|---|---|---|
artifact_name | Yes | Source identifier (owner/repo for GitHub, repo-url/chart-name for Helm) |
source_type | Yes | github_releases or helm_repository |
name | No | Human-readable job name |
schedule | No | Cron expression for recurring execution |
tag_filter | No | Regex to filter which tags to include |
version_regex | No | Regex with capture group to extract version from tag |
source_config | No | JSONB for source-specific configuration |
labels | No | JSONB key-value pairs for categorization (e.g., {"category": "kubernetes"}) |
is_global | No | Create as global resource (system keys only) |
Business Rules
Tag Filtering:
- Optional regex pattern applied to release tags
- Non-matching tags are excluded from results
- Invalid regex patterns logged as warnings, fall back to no filtering
Version Extraction:
- If
version_regexhas capture group → use first captured group - If
version_regexmatches but no capture group → use full match - Fallback: Strip leading
vorVprefix (GitHub) or return as-is (Helm)
Pagination Limits:
- GitHub: 100 releases per page, max 10 pages (1,000 releases total)
- Helm: Max 1,000 versions per chart, 50 MB index file limit
Prerelease Detection (Helm): Versions containing these patterns (case-insensitive) are marked as prereleases:
-alpha,-beta,-rc,-dev,-preview,-snapshot.alpha,.beta,.rc,.dev,.preview,.snapshot
State Machine
┌──────────────────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌─────────┐ ┌────────┐ ┌───────────┐ ┌────────┐ │
│ CREATED │───▶│ PENDING │───▶│ QUEUED │───▶│IN_PROGRESS│───▶│COMPLETED│──────┘
└─────────┘ └─────────┘ └────────┘ └───────────┘ └────────┘ (reschedule
▲ ▲ │ if cron)
│ │ │
│ │ ┌──────────┴──────────────┐
│ │ │ │
│ │ ▼ ▼
│ │ attempts < max attempts >= max
│ │ │ │
│ │ │ ▼
│ └───┘ ┌────────┐
│ (retry with backoff) │ FAILED │
│ └────────┘
│
└── (reschedule / stale reset / orphan reset)
- pending: Job is scheduled for the future (next_run_at > now) or waiting for retry_after
- queued: Job is eligible for agent pickup (next_run_at <= now, retry_after passed)
- in_progress: Agent has claimed and is executing the job
Key Transitions:
| Transition | Trigger | SQL Function |
|---|---|---|
| → pending | Job created (future schedule) | CreateGatherJob |
| → queued | Job created (immediate run) | TriggerGatherJobNow |
| pending → queued | Schedule time reached | TransitionPendingToQueuedGatherJobs |
| queued → in_progress | Agent claims job | ClaimNextQueuedGatherJob (SKIP LOCKED) |
| in_progress → completed | Agent reports success | CompleteGatherJob |
| in_progress → queued | Agent reports failure (retries remain) | FailGatherJob |
| in_progress → failed | Agent reports failure (max attempts) | FailGatherJob |
| completed → pending | Cron schedule triggers | RescheduleGatherJob |
| in_progress → queued | Job stale >1 hour | ResetStaleGatherJobs |
| * → queued | Claimed agent disconnected | ResetOrphanedGatherJobs |
Error Handling
Exponential Backoff:
retry_delay = 2^(attempts+1) seconds
Attempt 1: 4 seconds
Attempt 2: 8 seconds
Attempt 3: 16 seconds
...
GitHub-Specific Errors:
| HTTP Status | Error Message |
|---|---|
| 401 | GitHub authentication failed: invalid or missing token |
| 403 (rate limit) | GitHub rate limit exceeded: resets at <timestamp> |
| 403 (other) | GitHub access forbidden: repository may be private |
| 404 | GitHub repository not found: <owner/repo> |
| 429 | GitHub secondary rate limit hit: retry after N seconds |
Side Effects
- On Success: Upsert releases into
upstream_releasestable (conflict on artifact_name + version) - On Reschedule: Calculate next run time using cron expression
- On Config Update: Delete all existing releases (orphaned by config change)
Incremental Sync (GitHub)
GitHub gather jobs support incremental sync to reduce API calls and improve performance for repositories with many releases.
How it works:
First run (full sync): The gatherer fetches all releases by paginating through the GitHub Releases API (up to 100 pages × 100 per page = 10,000 releases max). After completion, the processor writes sync state to
gather_jobs.sync_state:full_sync_complete:trueif all releases were fetched without hitting the page limitreleases_fetched: total count from this runlast_full_sync_at: current timestamp (only set for full syncs)
Subsequent runs (incremental): If
full_sync_completeistrueandlast_full_sync_atis within the full sync interval (default: 2 weeks), the dispatcher injects an_incremental_sincehint into the agent’s source config. The GitHub gatherer uses this date to stop pagination early — once it encounters a release older than or equal to the hint date, it stops fetching additional pages.Periodic full sync: When
last_full_sync_atexceeds the configuredGATHER_FULL_SYNC_INTERVAL(default: 336h / 2 weeks), the dispatcher omits the incremental hint, forcing a full re-fetch. This catches edited releases, backdated tags, or metadata changes that incremental mode would miss.Sync state reset: Sync state is reset to
{}when:- A gather job’s configuration is updated (via
PUT /gather-jobs/{id}) - A gather job’s releases are cleared (via
POST /gather-jobs/{id}/clear-releases)
- A gather job’s configuration is updated (via
Scope: Only github_releases source type supports incremental sync. Helm, OCI, and endoflife.date gatherers always do full fetches (they don’t paginate the same way).
3. Scrape Jobs Domain
Files: pkg/scraper/scraper.go, pkg/parser/*.go, pkg/api/scrape_jobs_handler.go, pkg/storage/queries/scrape_jobs.sql
Purpose
Extract deployed software versions from Git repositories by parsing configuration files (YAML, JSON, or text) to establish the “currently deployed version” for staleness comparison.
Input Parameters
| Field | Required | Description |
|---|---|---|
repository_url | Conditional | Git repository URL (HTTPS or SSH; not required for manual) |
target_file | Conditional | Path to file containing version (e.g., Chart.yaml; not required for manual) |
parse_type | Yes | Parser type: yq, jq, regex, or manual |
parse_expression | Conditional | Parser-specific expression (not required for manual) |
ref | Conditional | Git ref to checkout (default: main; not required for manual) |
credential_name | No | Named credential for authentication |
schedule | No | Cron expression for recurring execution |
version_transform | No | Post-parse transformation |
history_limit | No | Max version snapshots to retain (1-20) |
Parser Implementations
YQ Parser (YAML)
Expression Format: Dot-notation with array indexing
.version → Simple field
.metadata.version → Nested path
.dependencies[0].version → Array access
.items[2].name → Nested array access
Features:
- Supports both
map[string]anyandmap[any]anyYAML structures - Uses
gopkg.in/yaml.v3for parsing
JQ Parser (JSON)
Expression Format: Dot-notation (no array indexing)
.version → Simple field
.dependencies.react → Nested path
Limitations: Does not support array indexing (use YQ for JSON with arrays)
Regex Parser (Text)
Expression Format: Go regex pattern
version:\s*([\d.]+) → Captures version after "version:"
^v(\d+\.\d+\.\d+)$ → Captures semver from full line
Behavior: Returns first capture group if present, otherwise full match
Manual Entry (No Agent)
Manual parse type allows users to enter version strings directly via the API or UI, bypassing the agent-based scraping pipeline entirely.
Behavior:
- No repository is cloned; no file is parsed
- Placeholder values are stored:
repository_url="manual://",target_file="-",parse_expression="-" - Jobs are created with
status='completed'(neverpending— agents never pick them up) - Version is set via
POST /scrape-jobs/{id}/set-versionwith aversionstring in the request body - Version transforms are applied to the manually entered version before storage
- A version snapshot is created and rule evaluation is triggered, identical to agent-completed jobs
Use Cases:
- Demo and testing environments (no agent infrastructure needed)
- Manual version tracking for systems that can’t be scraped
- Quick setup to exercise the full alert pipeline (rules, alerts, notifications)
Key Differences from Agent-Based Jobs:
| Aspect | Agent-Based | Manual |
|---|---|---|
| Task claiming | Polled by agents via SKIP LOCKED | Never enters task queue |
| Initial status | pending | completed |
| “Run Now” button | Triggers agent execution | Not available |
| Version entry | Automatic (parser output) | Via set-version API/UI |
| Required fields | repo URL, ref, target file, expression | Parse type only |
Version Transforms
| Transform | Example Input | Output |
|---|---|---|
none | 1.2.3 | 1.2.3 |
add_v_lower | 1.2.3 | v1.2.3 |
add_v_upper | 1.2.3 | V1.2.3 |
strip_v_lower | v1.2.3 | 1.2.3 |
strip_v_upper | V1.2.3 | 1.2.3 |
Credential-Aware Assignment
Jobs requiring credentials are only assigned to agents that have those credentials:
-- ClaimNextPendingScrapeJobWithCredentials
WHERE (
credential_name IS NULL
OR credential_name = ANY(available_credentials::VARCHAR[])
)
Flow:
- Agent declares
available_credentialsduring heartbeat - Job poll includes credential list
- Server only assigns jobs where agent has required credential or job needs no credential
Git Clone Optimization
All git clone operations use shallow clones for performance and reduced disk usage:
| Setting | Value | Purpose |
|---|---|---|
Depth | 1 | Only fetch the latest commit (no history) |
SingleBranch | true | Only fetch the requested branch/tag |
Benefits:
- Significantly faster clone times, especially for large repositories
- Reduced disk usage on agents (no commit history stored)
- Lower bandwidth consumption from git servers
- Minimal data transfer for version extraction (only need file content, not history)
Implementation (pkg/git/cloner.go):
cloneOpts := &git.CloneOptions{
URL: url,
Depth: 1, // Shallow clone
SingleBranch: true, // Only requested ref
ReferenceName: ref,
}
Temporary Directory Lifecycle:
- Create temp directory:
planekeeper-clone-* - Shallow clone repository
- Read target file for version extraction
- Delete temp directory (cleanup)
State Machine
Identical to Gather Jobs (see Section 2), with these additional behaviors:
- On Success: Create
version_snapshotrecord, trigger async rule evaluation - History Limit: Older snapshots beyond limit are automatically deleted
- Manual Jobs: Start and remain in
completedstatus. Version updates viaset-versioncreate snapshots and trigger rule evaluation without changing job status
Side Effects
- On Success:
- Insert
version_snapshotwith version, raw_content, commit_sha, metadata - Trigger
ruleEvaluator.EvaluateForOrg()asynchronously
- Insert
- On History Limit Exceeded: Delete oldest snapshots via orphan cleanup service
Version Snapshot Storage
Files: pkg/storage/queries/version_snapshots.sql, pkg/api/tasks_handler.go:661-706
Full History Tracking
Every scrape creates a new version snapshot record, regardless of whether the version has been seen before. This enables proper tracking of version changes including rollbacks.
Example Scenario:
Time T1: Scrape finds version 6.11.1 → Snapshot #1 created
Time T2: Scrape finds version 9.3.7 → Snapshot #2 created
Time T3: Rollback to 6.11.1 → Snapshot #3 created (NOT a duplicate)
Key Design Decisions:
No Unique Constraint on Version: The table allows multiple snapshots with the same version for the same job. This is intentional—each scrape represents a point-in-time observation.
ID-Based Ordering: The “latest” version snapshot is determined by
ORDER BY id DESC, not by timestamp. Since IDs are auto-incrementing, this guarantees the most recently inserted row is always returned, regardless of timestamp precision issues.History Retention: The
history_limitsetting on scrape jobs controls how many snapshots to retain. Older snapshots beyond this limit are automatically purged.
Database Schema
-- No unique constraint on version - allows duplicate versions for rollback tracking
CREATE TABLE version_snapshots (
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
organization_id BIGINT NOT NULL,
scrape_job_id BIGINT,
repository_url VARCHAR(2048) NOT NULL,
ref VARCHAR(256),
target_file VARCHAR(1024) NOT NULL,
version VARCHAR(256) NOT NULL,
raw_content TEXT,
metadata JSONB DEFAULT '{}',
discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index for efficient latest-version lookups
CREATE INDEX idx_version_snapshots_latest
ON version_snapshots(scrape_job_id, id DESC);
Query Pattern
-- GetLatestVersionSnapshot: Returns most recently inserted snapshot
SELECT version, discovered_at
FROM version_snapshots
WHERE scrape_job_id = $1
ORDER BY id DESC -- NOT discovered_at - ID is more reliable
LIMIT 1;
Why ID instead of timestamp?
- Timestamps can have precision issues (same second, microsecond truncation)
- Multiple inserts in quick succession may get identical timestamps
- Auto-incrementing IDs guarantee insertion order
- Avoids race conditions during concurrent operations
Rule Evaluation Timing
When a scrape job completes (agent-based):
1. Agent sends task completion to API
2. API creates version snapshot (INSERT)
3. API marks job as completed
4. API triggers rule evaluation asynchronously
└── EvaluateForOrg() spawns goroutine internally
5. Rule engine queries for latest version snapshot
└── Uses id DESC to get most recently inserted row
When a manual version is set:
1. User calls POST /scrape-jobs/{id}/set-version with version string
2. API applies version transform (if configured)
3. API creates version snapshot using shared recordVersionSnapshot helper
4. API triggers rule evaluation asynchronously (same as agent flow)
The rule evaluation runs asynchronously (via goroutine) but uses a fresh database connection that sees the committed snapshot data.
4. Helm Sync Jobs Domain
Files: pkg/api/helm_sync_handlers.go, pkg/taskengine/processor.go:293-427
Purpose
Automatically discover Helm charts from a repository and create/manage child gather jobs for each chart, enabling bulk monitoring of Helm-based deployments.
Input Parameters
| Field | Required | Description |
|---|---|---|
repository_url | Yes | Helm repository URL |
chart_filter | No | Regex to filter charts by name |
default_schedule | No | Cron schedule inherited by child jobs |
default_tag_filter | No | Tag filter inherited by child jobs |
default_version_regex | No | Version regex inherited by child jobs |
auto_delete_orphans | No | Delete child jobs for removed charts |
Business Rules
Chart Discovery:
- Fetch
index.yamlfrom repository - Apply optional
chart_filterregex - Extract chart name, description, latest version
Child Job Management:
For each discovered chart:
├── Build artifact_name: "repo-url/chart-name"
├── Check if gather job exists for this artifact
│ ├── YES: Skip (job already exists)
│ └── NO: Create gather job with:
│ ├── source_type = helm_repository
│ ├── schedule = default_schedule
│ ├── tag_filter = default_tag_filter
│ ├── version_regex = default_version_regex
│ └── parent_sync_job_id = this job's ID
└── Track artifact_name in discovered list
Orphan Deletion (when auto_delete_orphans = true):
DELETE FROM gather_jobs
WHERE parent_sync_job_id = @sync_job_id
AND artifact_name NOT IN (@discovered_artifacts)
State Machine
Same as Gather Jobs, with additional post-completion processing to create/delete child jobs.
Side Effects
- On Success:
- Create gather jobs for newly discovered charts
- Delete gather jobs for removed charts (if auto_delete_orphans)
- Reschedule if cron schedule exists
5. Task Execution System
Files: pkg/taskengine/dispatcher.go, pkg/taskengine/types.go, pkg/shared/types.go, pkg/taskengine/errors.go, pkg/api/tasks_handler.go
Purpose
Coordinate distributed job execution between agents and server through a token-based system with idempotency guarantees.
Token Lifecycle
┌──────────────────────────────────────────────────────────────────┐
│ TOKEN LIFECYCLE │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌────────────┐ ┌────────────┐ ┌────────┐ │
│ │GENERATION│───▶│ VALIDATION │───▶│ COMPLETION │───▶│CONSUMED│ │
│ └──────────┘ └────────────┘ └────────────┘ └────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ UUID created Check expiry Mark complete Token │
│ Expiry set Check status Process result invalidated │
│ Stored in DB Return task Update job │
│ │
└──────────────────────────────────────────────────────────────────┘
Generation:
- New UUID via
uuid.New() - Expiry:
time.Now().UTC().Add(timeout)(default 5 minutes, configurable viatask_execution_timeout_seconds) - Stored in
task_executionstable
Validation:
- Lookup by token
- Check status is
in_progress - Check not expired
Completion:
- Update status to
completedorfailed - Process job-specific results
- Clear execution link from job
Assignment Flow
Agent polls: POST /tasks/{AgentUUID}/poll
│
├── Capabilities: ["gather", "scrape", "helm_sync"]
├── Available credentials: ["github_token", "ssh_key"]
├── Organization ID: from agent's API key
│
▼
Dispatcher.AssignTaskWithCredentials()
│
├── Try ClaimNextPendingGatherJob (SKIP LOCKED, org-scoped)
├── Try ClaimNextPendingScrapeJob (SKIP LOCKED, org-scoped)
└── Try ClaimNextPendingHelmSyncJob (SKIP LOCKED, org-scoped)
│
▼
CreateTaskExecution() + Return TaskAssignment
│
├── execution_token: UUID
├── expires_at: timestamp
├── job_type: "gather"|"scrape"|"helm_sync"
├── job_id: int64
└── task_data: job-specific details
Completion Flow
Agent completes: POST /tasks/{AgentUUID}/complete
│
├── execution_token: UUID
├── success: bool
├── error: optional string
└── result_data: JSON
│
▼
CompleteExecution()
│
├── Lookup execution by token
│ └── Not found → ErrTokenInvalid (409)
│
├── Check status == 'in_progress'
│ └── Already completed → ErrTokenAlreadyCompleted (202)
│
├── Check not expired
│ └── Expired → ErrTokenExpired (409)
│
└── Update execution status
│
▼
ProcessTaskResult() (job-type specific)
│
└── Return 202 Accepted
Error States
| Error | HTTP Status | Scenario |
|---|---|---|
ErrNoTasksAvailable | 204 No Content | No pending tasks match agent capabilities |
ErrTokenInvalid | 409 Conflict | Token not found in database |
ErrTokenExpired | 409 Conflict | Token past expiry timestamp |
ErrTokenAlreadyCompleted | 202 Accepted | Idempotent retry (result already recorded) |
ErrJobNotFound | 500 | Job missing during result processing |
ErrUnsupportedJobType | 500 | Unknown job type in execution |
Idempotency Guarantees
Database-Level (SKIP LOCKED):
SELECT id FROM jobs
WHERE status = 'pending'
AND next_run_at <= CURRENT_TIMESTAMP
ORDER BY next_run_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
Prevents multiple agents from claiming the same job.
Organization-Level (Org-Scoped Claims):
All ClaimNextQueued* queries filter by organization_id, ensuring agents only claim jobs belonging to their own organization. The serveragent (org_id=0) only claims global jobs; clientagents only claim their respective organization’s jobs. This enforces multi-tenant isolation at the task claim layer.
Token-Level (State Check):
if execution.Status != repository.JobStatusInProgress {
return nil, ErrTokenAlreadyCompleted
}
Allows safe agent retries without duplicate processing.
6. Background Scheduler
Files: pkg/taskengine/engine.go, pkg/taskengine/scheduler.go, pkg/taskengine/processor.go
Processing Loops
The TaskEngine runs two concurrent background loops:
| Loop | Interval | Purpose | Batch Size |
|---|---|---|---|
| Processor | 5 seconds | Process pending results | 100 |
| Scheduler | 30 seconds | Timeout detection + scheduled activation | 100 |
Timeout Handling
Stale Job Detection:
Jobs in in_progress status for more than 1 hour are reset to pending:
-- ResetStaleGatherJobs / ResetStaleScrapeJobs
UPDATE jobs
SET status = 'pending',
claimed_by = NULL,
claimed_at = NULL
WHERE status = 'in_progress'
AND claimed_at < NOW() - INTERVAL '1 hour'
Purpose: Recover from agent crashes where no completion/failure was reported.
Scheduled Activation
Jobs with cron schedules follow this lifecycle:
┌──────────┐ ┌─────────────┐ ┌───────────┐ ┌───────────┐
│ COMPLETED│──▶│ RESCHEDULE │──▶│ WAITING │──▶│ ACTIVATED │
│ │ │ │ │ │ │ │
│ Job done │ │ Calculate │ │next_run_at│ │ Set status│
│ │ │ next_run_at │ │ in future │ │ = pending │
└──────────┘ └─────────────┘ └───────────┘ └───────────┘
│ │
│ time passes │
└───────────────┘
Rescheduling (Processor, on completion):
nextRun := cron.NextRun(job.Schedule, time.Now().UTC())
repo.RescheduleJob(jobID, nextRun)
Activation (Scheduler, every 30 seconds):
-- GetScheduledJobsReadyToRun
SELECT id FROM jobs
WHERE status = 'completed'
AND schedule IS NOT NULL
AND next_run_at <= CURRENT_TIMESTAMP
-- ActivateScheduledJob
UPDATE jobs
SET status = 'pending',
next_run_at = NULL
WHERE id = @id
Orphan Detection
Service-Based Orphan Recovery:
Jobs claimed by agents no longer in service_instances table are reset:
-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending',
claimed_by = NULL
WHERE status IN ('pending', 'in_progress')
AND claimed_by IS NOT NULL
AND claimed_by NOT IN (
SELECT instance_uuid FROM service_instances WHERE service_id = 2
)
Orphan Cleanup Service (Server):
- Runs every 2 minutes
- 30-second startup delay (allows agents to register)
- Also enforces version snapshot history limits
7. Rules Engine
Files: pkg/rules/evaluator.go, pkg/rules/engine.go, pkg/rules/types.go
Rule Types
| Rule Type | Measures | Calculation |
|---|---|---|
days_behind | Age of discovered version | time.Since(releaseDate).Hours() / 24 |
majors_behind | Major version difference | latestMajor - discoveredMajor |
minors_behind | Minor version difference | Release-list counting or formula-based |
Threshold Tiers
Each rule defines three thresholds (must be ordered: moderate ≤ high ≤ critical):
type Rule struct {
ModerateThreshold int // First tier
HighThreshold int // Second tier
CriticalThreshold int // Third tier
}
Severity Determination (highest wins):
if behindBy >= CriticalThreshold → CRITICAL
else if behindBy >= HighThreshold → HIGH
else if behindBy >= ModerateThreshold → MODERATE
else → No violation
Evaluation Algorithm
1. Validate inputs (config, rule, discovered version required)
│
▼
2. Get latest release from gather job
└── If none available → No violation (can't compare)
│
▼
3. Apply stable_only filter (if enabled)
└── If latest is prerelease → Find next stable release
│
▼
4. Calculate behindBy based on rule type:
│
├── days_behind:
│ └── Get discovered version's release date
│ └── If not found → ErrVersionNotFound → CRITICAL
│ └── behindBy = days since release
│
├── majors_behind:
│ └── Parse both versions as semver
│ └── If parse fails → ErrVersionParseFailed → CRITICAL
│ └── behindBy = latestMajor - discoveredMajor
│
└── minors_behind:
├── Preferred: Count unique major.minor from releases list
└── Fallback: Formula calculation
│
▼
5. Determine severity from thresholds
│
▼
6. Return EvaluationResult with severity, behindBy, message
Minors Behind Calculation
Release-List Mode (preferred):
Counts unique major.minor combinations between discovered and latest versions.
Formula Mode (fallback when releases unavailable):
Same major: latestMinor - discoveredMinor
Different major: (latestMajor - discoveredMajor) + latestMinor
Example: 7.9 → 8.1 = (8-7) + 1 = 2 minors behind
Example: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind
Stable-Only Filtering
When stable_only = true, prereleases are excluded from latest version lookup.
Prerelease Detection (IsStableVersion()):
Returns false if version contains (case-insensitive):
alpha,beta,rc,dev,snapshot,canary,nightly,pre
Special Cases
Version Not Found:
- Trigger: Discovered version has no release record (for
days_behind) - Result: Marked as CRITICAL with
behindBy = -1 - Message:
"Version X not found in upstream releases - cannot determine age (marked as critical)"
Version Parse Failure:
- Trigger: Semver parsing fails for either version
- Result: Marked as CRITICAL with
behindBy = -1 - Message:
"Cannot parse version for comparison: <error> (marked as critical)"
8. Alert System
Files: pkg/alerts/service.go, pkg/api/alerts_handlers.go, pkg/api/alert_configs_handler.go, pkg/rules/engine.go, pkg/storage/queries/alerts.sql
Core Concepts
One Alert Per Config: Each alert configuration has at most ONE active alert at any time. When a scrape job discovers a new version, the existing alert is updated in place rather than creating a new alert. This ensures alerts always reflect the current state of the monitored system.
Soft Delete: Resolved alerts are soft-deleted (marked with resolved_at timestamp) rather than permanently deleted. This preserves alert history for auditing and analysis.
Alert Lifecycle
┌─────────────────────────────────────────────────────────────────┐
│ ALERT LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌────────────────┐ ┌─────────────────────┐ │
│ │ CREATED │───▶│ VERSION UPDATE │───▶│ ACKNOWLEDGED │ │
│ │ │ │ │ │ │ │
│ │ Violation│ │ Alert updated │ │ User marks as │ │
│ │ detected │ │ with new │ │ reviewed │ │
│ └──────────┘ │ version data │ └─────────────────────┘ │
│ │ └────────────────┘ │ │
│ │ │ │ │
│ │ │ ┌─────────────────┘ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ UNACKNOWLEDGED │ │
│ │ │ │ │
│ │ │ Ack reset when │ │
│ │ │ version changes │ │
│ │ └──────────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ RESOLVED │ │
│ │ Version updated, no longer violates │ │
│ │ → Alert soft-deleted (resolved_at = now) │ │
│ │ → Preserved in history via /alerts/resolved endpoint │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Database Schema
Key Fields:
| Field | Type | Description |
|---|---|---|
alert_config_id | BIGINT | Link to alert configuration (unique per active alert) |
discovered_version | VARCHAR | Version found by scrape job |
latest_version | VARCHAR | Latest upstream version |
behind_by | INT | How far behind (days, majors, or minors) |
severity | ENUM | moderate, high, or critical |
is_acknowledged | BOOLEAN | Whether user has acknowledged |
resolved_at | TIMESTAMP | When resolved (NULL = active) |
Unique Constraint: Partial unique index on alert_config_id WHERE resolved_at IS NULL
- Ensures only ONE active alert per config
- Allows multiple resolved alerts in history
Upsert Behavior
Alerts use a partial unique index for upsert operations:
INSERT INTO alerts (...)
ON CONFLICT (alert_config_id) WHERE resolved_at IS NULL
DO UPDATE SET
discovered_version = EXCLUDED.discovered_version,
latest_version = EXCLUDED.latest_version,
behind_by = EXCLUDED.behind_by,
severity = EXCLUDED.severity,
details = EXCLUDED.details,
updated_at = CURRENT_TIMESTAMP,
-- Only reset acknowledgement when version actually changes
is_acknowledged = CASE
WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN FALSE
ELSE alerts.is_acknowledged
END,
acknowledged_by = CASE
WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
ELSE alerts.acknowledged_by
END,
acknowledged_at = CASE
WHEN alerts.discovered_version != EXCLUDED.discovered_version THEN NULL
ELSE alerts.acknowledged_at
END
Key Behaviors:
- Single Alert: Each config has at most one active alert
- In-Place Updates: When the scrape job discovers a new version, the existing alert updates with the new version data
- Smart Acknowledgment Reset: Acknowledgment is only cleared when the discovered version changes, not on every evaluation
- Preserved History: Resolved alerts remain in the database with
resolved_atset
Alert Resolution (Soft Delete)
When an alert config no longer violates (e.g., version was updated):
-- ResolveAlert: Soft delete by setting resolved_at
UPDATE alerts
SET resolved_at = CURRENT_TIMESTAMP,
updated_at = CURRENT_TIMESTAMP
WHERE id = @id AND resolved_at IS NULL
RETURNING *;
Resolution triggers notification: When an alert is resolved, an alert.resolved notification is dispatched to configured channels.
Audit Trail (Alert Actions)
Every alert state change is recorded in the audit_trail table using a decorator pattern. The AuditedAlertService wraps the AlertService interface and transparently records entries after each successful operation.
Tracked Actions:
| Action | Trigger | Source |
|---|---|---|
created | Rule evaluation creates a new alert | system |
escalated | Rule evaluation increases severity | system |
acknowledge | User acknowledges via UI/API/webhook | ui, api, or webhook |
unacknowledge | User removes acknowledgment, or version changes reset it | ui, api, or system |
resolve | User manually resolves, or rule evaluation resolves | ui, api, or system |
Source Determination: The source is derived from the authentication method used:
- JWT authentication →
ui(browser-based user action) - API key authentication →
api(programmatic access) - Webhook callback token →
webhook(external callback) - No auth context (system operation) →
system(rules engine, auto-reset)
Data Model: The audit_trail table uses a polymorphic design (entity_type + entity_id) that can be extended to other entity types in the future. Each entry stores the action type, source, optional actor email (for user-initiated actions), and optional metadata as JSONB. Audit records persist independently of entity lifecycle (no cascade delete on the alert FK).
API Endpoint: GET /alerts/{id}/actions returns paginated audit trail entries for a specific alert. The alert detail UI merges these entries with notification deliveries into a unified activity timeline sorted by timestamp.
Prometheus Metrics: The audit writer exposes planekeeper_audit_writer_events_written_total, planekeeper_audit_writer_persist_errors_total, and planekeeper_audit_writer_insert_duration_seconds_total via the /metrics endpoint.
Auto-Triggers (Event-Driven)
Rule evaluation is triggered asynchronously via the event bus system (pkg/events/). When triggering events occur, they are published to the event bus, and the RuleEvaluationSubscriber handles the evaluation in a goroutine.
Event Flow:
┌─────────────────────┐ ┌──────────────────┐ ┌─────────────────────────┐
│ Job Completion / │───▶│ Event Bus │───▶│ RuleEvaluationSubscriber│
│ Config Change │ │ (pkg/events/) │ │ (async goroutine) │
└─────────────────────┘ └──────────────────┘ └─────────────────────────┘
Triggering Events:
| Event Type | Trigger | Published By |
|---|---|---|
job.scrape.completed | Scrape job completes successfully | tasks_handler.go |
job.gather.completed | Gather job completes successfully | tasks_handler.go |
alert_config.created | Alert config created and active | alert_configs_handler.go |
alert_config.updated | Alert config updated and active | alert_configs_handler.go |
alert_config.toggled | Alert config toggled to active | alert_configs_handler.go |
Benefits of Event-Driven Triggers:
- Decoupling: Handlers don’t need direct references to the rule evaluation logic
- Non-blocking: HTTP handlers return immediately; evaluation runs async
- Extensibility: Other subscribers can react to the same events (metrics, logging, etc.)
- Scalability: Event bus can be swapped for external queue (Redis, NATS) for distributed evaluation
Alert Config Composition
An alert config links three entities:
┌─────────────────────────────────────────────────────────────┐
│ ALERT CONFIG │
├─────────────────────────────────────────────────────────────┤
│ │
│ scrape_job_id ───▶ "What version did we deploy?" │
│ (discovered_version) │
│ │
│ gather_job_id ───▶ "What's the latest available?" │
│ (latest_version) │
│ │
│ rule_id ───▶ "How do we measure staleness?" │
│ (days_behind, majors_behind, etc.) │
│ │
│ UNIQUE (org_id, scrape_job_id, gather_job_id, rule_id) │
│ │
└─────────────────────────────────────────────────────────────┘
API Endpoints
Active Alerts:
| Method | Path | Description |
|---|---|---|
| GET | /alerts | List active (non-resolved) alerts |
| GET | /alerts/summary | Count active alerts by severity |
| POST | /alerts/{id}/acknowledge | Acknowledge an active alert |
| POST | /alerts/{id}/unacknowledge | Remove acknowledgement |
| POST | /alerts/acknowledge | Bulk acknowledge multiple alerts |
Resolved Alerts (History):
| Method | Path | Description |
|---|---|---|
| GET | /alerts/resolved | List resolved alerts with pagination |
Query Parameters (for both endpoints):
limit- Max results (default 50, max 100)offset- Pagination offsetseverity- Filter by severity levelunacknowledged_only- Only show unacknowledged (active only)
Notification Events
The alert system dispatches notifications for lifecycle events:
| Event | Trigger |
|---|---|
alert.created | New alert created (first violation) |
alert.escalated | Severity increased (e.g., high → critical) |
alert.acknowledged | User acknowledges via API |
alert.unacknowledged | User removes acknowledgement |
alert.resolved | Version updated, no longer violates |
Note: Non-escalating updates (same severity, just refreshed data) do not trigger notifications.
Event-Driven Alert Service
Files: pkg/alerts/service.go
All alert state changes flow through a centralized alert service (pkg/alerts/Service) that automatically dispatches notifications. This ensures consistent notification behavior regardless of how alerts are modified (API, webhook, rules engine).
Architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ ALERT STATE CHANGES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ API Handlers │──┐ │
│ │ (ack/unack) │ │ │
│ └─────────────────┘ │ │
│ │ ┌─────────────────────────────────────┐ │
│ ┌─────────────────┐ │ │ ALERT SERVICE │ │
│ │ Webhook Ack │──┼─────▶│ pkg/alerts/service.go │ │
│ │ (external) │ │ │ │ │
│ └─────────────────┘ │ │ • Acknowledge() │ │
│ │ │ • Unacknowledge() │ │
│ ┌─────────────────┐ │ │ • Upsert() (create/update) │ │
│ │ Rules Engine │──┘ │ • Resolve() │ │
│ │ (evaluation) │ │ │ │
│ └─────────────────┘ │ ───────────────────────────────── │ │
│ │ Auto-dispatches notifications: │ │
│ │ • alert.created │ │
│ │ • alert.escalated │ │
│ │ • alert.acknowledged │ │
│ │ • alert.unacknowledged │ │
│ │ • alert.resolved │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Notification Dispatcher │ │
│ │ pkg/notifications/dispatcher.go │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Service Methods:
| Method | Description | Event Dispatched |
|---|---|---|
Acknowledge() | Mark alert acknowledged | alert.acknowledged |
Unacknowledge() | Remove acknowledgment | alert.unacknowledged |
Upsert() | Create or update alert | alert.created or alert.escalated |
Resolve() | Mark alert resolved | alert.resolved |
ResolveByConfigID() | Resolve by config | alert.resolved |
BulkAcknowledge() | Bulk operation | None (avoids spam) |
Benefits:
- Consistent Notifications: Every state change automatically dispatches events
- Single Source of Truth: All alert logic centralized in one service
- Thin Handlers: API handlers become simple pass-through to service
- Testability: Service can be mocked for unit testing
Usage Pattern:
// Before (scattered notification dispatch):
alert, err := repo.AcknowledgeAlert(ctx, params)
dispatcher.DispatchForAlert(ctx, alert, EventAlertAcknowledged)
// After (single service call does everything):
alert, err := alertService.Acknowledge(ctx, params)
// Notification automatically dispatched
Event Bus System
Files: pkg/events/bus.go, pkg/events/types.go, pkg/events/subscribers.go
The system uses an in-process event bus for decoupled asynchronous communication between components. This enables loose coupling while maintaining reliability within a single process.
Architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ EVENT BUS SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Publishers Event Bus Subscribers │
│ ─────────── ───────── ─────────── │
│ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ TasksHandler │──┐ │ │ ┌──────────────┐ │
│ │ (job completed) │ │ │ In-process Bus │ │ RuleEval │ │
│ └─────────────────┘ │ │ │ │ Subscriber │ │
│ ├──▶│ • Buffered channel │──▶│ │ │
│ ┌─────────────────┐ │ │ • Async delivery │ │ Evaluates │ │
│ │ AlertConfigs │──┘ │ • Panic recovery │ │ rules for │ │
│ │ (config change) │ │ │ │ organization │ │
│ └─────────────────┘ └─────────────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Event Types:
| Event Type | Description | Payload |
|---|---|---|
job.scrape.completed | Scrape job finished | JobCompletedEvent |
job.gather.completed | Gather job finished | JobCompletedEvent |
job.helm_sync.completed | Helm sync job finished | JobCompletedEvent |
alert_config.created | Alert config created | AlertConfigChangedEvent |
alert_config.updated | Alert config updated | AlertConfigChangedEvent |
alert_config.toggled | Alert config toggled | AlertConfigChangedEvent |
Event Bus Features:
- Buffered Channel: Default buffer of 100 events; non-blocking publish with overflow warning
- Async Delivery: Single dispatcher goroutine processes events sequentially
- Panic Recovery: Handler panics are caught and logged without crashing the bus
- Graceful Shutdown:
Stop()method drains pending events before closing
Subscriber Pattern:
// Subscribers implement their own logic
type RuleEvaluationSubscriber struct {
db *postgres.Database
alertService *alerts.Service
}
// Subscribe registers handlers for relevant event types
func (s *RuleEvaluationSubscriber) Subscribe(bus *events.Bus) {
bus.SubscribeMultiple([]events.EventType{
events.EventJobScrapeCompleted,
events.EventJobGatherCompleted,
}, s.handleJobCompleted)
}
Future Extensibility:
The event bus interface can be swapped for an external message queue (Redis Streams, NATS, RabbitMQ) if horizontal scaling requires distributed event processing. The subscriber pattern remains the same; only the transport layer changes.
9. Notification System
Files: pkg/notifications/, pkg/api/notification_*_handlers.go, app/notifier/, pkg/storage/queries/notification_*.sql
Purpose
Deliver notifications about alert events to external systems via webhooks and other channel types. The system supports configurable routing rules, retry logic with exponential backoff, and acknowledgment callbacks.
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ SERVER │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Dispatcher │ │
│ │ - Evaluates notification rules on alert events │ │
│ │ - Creates delivery records in notification_deliveries │ │
│ │ - Does NOT send webhooks directly │ │
│ └──────────────────────────┬──────────────────────────────────┘ │
└─────────────────────────────┼────────────────────────────────────┘
│ INSERT into notification_deliveries
▼
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL │
│ notification_deliveries (status: pending) │
└─────────────────────────────┬───────────────────────────────────┘
│ FOR UPDATE SKIP LOCKED
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ notifier │ │ notifier │ │ notifier │
│ replica 1 │ │ replica 2 │ │ replica N │
│ (worker) │ │ (worker) │ │ (worker) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└────────────────┼────────────────┘
▼
External Webhooks
9.1 Channel Types
| Channel Type | Status | Description |
|---|---|---|
webhook | Implemented | HTTP POST to external URL with JSON payload |
pagerduty | Planned | Native PagerDuty Events API integration |
telegram | Planned | Telegram Bot API integration |
smtp | Planned | Email notifications |
9.2 Notification Channels
Table: notification_channels
Channels define delivery endpoints (webhooks) with organization-scoped configuration.
| Field | Description |
|---|---|
id | Unique identifier |
organization_id | Owning organization |
name | Human-readable name |
channel_type | Type: webhook, pagerduty, telegram, smtp |
config | JSONB with channel-specific configuration |
is_active | Whether channel is enabled |
max_retries | Per-channel retry override (NULL = global default) |
last_test_at | Last test timestamp |
last_test_success | Last test result |
Webhook Config Structure (stored in config JSONB):
{
"url": "https://example.com/webhook",
"method": "POST",
"headers": {"Authorization": "Bearer xxx"},
"timeout_seconds": 30,
"ack_enabled": true,
"secret": "hmac-signing-secret",
"event_templates": {
"new_alert": "",
"acknowledged": "",
"resolved": ""
}
}
9.2.1 Event-Specific Templates
The notification system supports event-specific templates that allow customizing webhook payloads for different alert lifecycle events. This enables integration with services like Discord and Slack that require specific payload formats.
Template Categories:
| Category | Events | Description |
|---|---|---|
new_alert | alert.created, alert.escalated, test | New or escalated alerts |
acknowledged | alert.acknowledged, alert.unacknowledged | Acknowledgment state changes |
resolved | alert.resolved | Alert resolution |
Template Resolution Priority:
Templates are resolved in order of specificity (most specific wins):
1. Channel-specific template (config.event_templates.X)
│
▼ (if empty)
2. Organization-level template (settings table)
│
▼ (if empty)
3. Global default template (settings table, org_id = NULL)
│
▼ (if empty)
4. Standard JSON payload (no template)
This allows:
- Global defaults for all organizations (generic JSON for standard webhooks)
- Organization overrides for org-wide customization
- Channel-specific templates for platforms like Discord/Slack
Global Template Settings Keys:
| Setting Key | Category |
|---|---|
notification.template.alert.new | New/escalated alerts |
notification.template.alert.acknowledged | Acknowledgment events |
notification.template.alert.resolved | Resolution events |
9.2.2 Template Variables
Common Variables (available in all templates):
{{.IdempotencyKey}} - UUID, stable across retries
{{.Event}} - Event type (alert.created, etc.)
{{.Timestamp}} - ISO8601 timestamp
{{.Alert.ID}} - Alert ID
{{.Alert.ConfigName}} - Alert config name
{{.Alert.RuleName}} - Rule name
{{.Alert.RuleType}} - days_behind, majors_behind, etc.
{{.Alert.Severity}} - critical, high, moderate
{{.Alert.DiscoveredVersion}} - Current deployed version
{{.Alert.LatestVersion}} - Latest available version
{{.Alert.BehindBy}} - Number (days, versions, etc.)
{{.Alert.ArtifactName}} - Upstream artifact name
{{.Alert.RepositoryURL}} - Scrape job repository URL
{{.Alert.TargetFile}} - Scrape job target file
Event-Specific Variables:
| Variable | Available In | Description |
|---|---|---|
{{.AcknowledgeURL}} | new_alert only | Callback URL for one-click acknowledgment |
{{.AcknowledgedBy}} | acknowledged only | Email/identifier of acknowledging user |
{{.AcknowledgedAt}} | acknowledged only | ISO8601 timestamp of acknowledgment |
{{.IsAcknowledged}} | acknowledged only | true for acknowledged, false for unacknowledged |
{{.ResolvedAt}} | resolved only | ISO8601 timestamp of resolution |
{{.PreviousSeverity}} | new_alert (escalated) | Previous severity before escalation |
Template Functions:
| Function | Description | Example |
|---|---|---|
upper | Uppercase string | {{.Alert.Severity | upper}} → CRITICAL |
lower | Lowercase string | {{.Event | lower}} → alert.created |
json | JSON encode value | {{.Alert | json}} → {"id":1,...} |
9.2.3 Platform-Specific Examples
Generic JSON Webhook (default format when no template configured):
The system sends a structured JSON payload by default:
{
"idempotency_key": "uuid-here",
"event": "alert.created",
"timestamp": "2024-01-15T10:30:00Z",
"alert": {
"id": 123,
"config_name": "My Alert Config",
"rule_name": "Critical Updates",
"severity": "critical",
"discovered_version": "1.0.0",
"latest_version": "2.0.0",
"behind_by": 30
},
"acknowledge_url": "https://app.example.com/api/v1/webhooks/acknowledge/token"
}
Discord Webhook:
Discord requires a content field. Configure these channel-specific templates:
New Alert Template:
{"content": "🚨 **{{.Alert.Severity | upper}} Alert**: {{.Alert.ConfigName}}\n\n**Artifact:** {{.Alert.ArtifactName}}\n**Current:** {{.Alert.DiscoveredVersion}} → **Latest:** {{.Alert.LatestVersion}}\n**Behind by:** {{.Alert.BehindBy}} {{.Alert.RuleType}}\n\n[Acknowledge]({{.AcknowledgeURL}})"}
Acknowledged Template:
{"content": "{{if .IsAcknowledged}}✅ **Acknowledged**{{else}}🔄 **Unacknowledged**{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\n\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}
Resolved Template:
{"content": "🎉 **Resolved**: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\n\nThe version has been updated and no longer triggers this alert."}
Slack Webhook:
Slack uses a text field for simple messages:
New Alert Template:
{"text": ":rotating_light: *{{.Alert.Severity | upper}}*: {{.Alert.ConfigName}}\nArtifact: {{.Alert.ArtifactName}}\nVersion {{.Alert.DiscoveredVersion}} is {{.Alert.BehindBy}} behind latest ({{.Alert.LatestVersion}})\n<{{.AcknowledgeURL}}|Click to Acknowledge>"}
Acknowledged Template:
{"text": "{{if .IsAcknowledged}}:white_check_mark: *Acknowledged*{{else}}:arrows_counterclockwise: *Unacknowledged*{{end}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}{{if .IsAcknowledged}}\nAcknowledged by {{.AcknowledgedBy}}{{end}}"}
*Resolved Template*:
```json
{"text": ":tada: *Resolved*: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}}\nThe alert has been automatically resolved."}
Slack Block Kit (rich formatting):
For more sophisticated Slack messages, use Block Kit:
New Alert Template:
{
"blocks": [
{
"type": "header",
"text": {"type": "plain_text", "text": "🚨 {{.Alert.Severity | upper}} Alert"}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": "*Config:*\n{{.Alert.ConfigName}}"},
{"type": "mrkdwn", "text": "*Artifact:*\n{{.Alert.ArtifactName}}"},
{"type": "mrkdwn", "text": "*Current:*\n{{.Alert.DiscoveredVersion}}"},
{"type": "mrkdwn", "text": "*Latest:*\n{{.Alert.LatestVersion}}"}
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "Acknowledge"},
"url": "{{.AcknowledgeURL}}",
"style": "primary"
}
]
}
]
}
Microsoft Teams (Adaptive Cards):
New Alert Template:
{
"@type": "MessageCard",
"themeColor": "FF0000",
"title": "{{.Alert.Severity | upper}} Alert: {{.Alert.ConfigName}}",
"sections": [{
"facts": [
{"name": "Artifact", "value": "{{.Alert.ArtifactName}}"},
{"name": "Current Version", "value": "{{.Alert.DiscoveredVersion}}"},
{"name": "Latest Version", "value": "{{.Alert.LatestVersion}}"},
{"name": "Behind By", "value": "{{.Alert.BehindBy}}"}
]
}],
"potentialAction": [{
"@type": "OpenUri",
"name": "Acknowledge",
"targets": [{"os": "default", "uri": "{{.AcknowledgeURL}}"}]
}]
}
PagerDuty Events API v2:
New Alert Template:
{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "trigger",
"dedup_key": "{{.IdempotencyKey}}",
"payload": {
"summary": "{{.Alert.Severity | upper}}: {{.Alert.ConfigName}} - {{.Alert.ArtifactName}} is {{.Alert.BehindBy}} behind",
"source": "planekeeper",
"severity": "{{.Alert.Severity}}",
"custom_details": {
"discovered_version": "{{.Alert.DiscoveredVersion}}",
"latest_version": "{{.Alert.LatestVersion}}",
"repository": "{{.Alert.RepositoryURL}}",
"target_file": "{{.Alert.TargetFile}}"
}
},
"links": [{"href": "{{.AcknowledgeURL}}", "text": "Acknowledge in Planekeeper"}]
}
Resolved Template:
{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "resolve",
"dedup_key": "{{.IdempotencyKey}}"
}
9.2.4 Configuring Templates
Via UI:
- Navigate to Notification Channels → Edit channel
- Check “Use Event-Specific Templates”
- Enter templates for each event category (New Alert, Acknowledged, Resolved)
- Leave empty to inherit from organization/global defaults
Via API:
# Create channel with event-specific templates
curl -X POST /api/v1/client/notification-channels \
-H "X-API-Key: pk_..." \
-d '{
"name": "Discord Alerts",
"channel_type": "webhook",
"config": {
"url": "https://discord.com/api/webhooks/...",
"event_templates": {
"new_alert": "{\"content\": \"🚨 **{{.Alert.Severity}}**: {{.Alert.ConfigName}}\"}",
"acknowledged": "{\"content\": \"✅ Acknowledged: {{.Alert.ConfigName}}\"}",
"resolved": "{\"content\": \"🎉 Resolved: {{.Alert.ConfigName}}\"}"
}
}
}'
Organization-Level Defaults:
Set organization-wide templates via the settings API:
# Set org default for new alerts
curl -X PUT /api/v1/client/settings/notification.template.alert.new \
-H "X-API-Key: pk_..." \
-d '{"value": "{\"content\": \"🚨 {{.Alert.Severity}}: {{.Alert.ConfigName}}\"}"}'
9.2.5 Template Best Practices
Use channel-specific templates for non-standard webhooks: Discord, Slack, Teams, and PagerDuty all have specific payload formats. Configure these at the channel level.
Keep global defaults generic: The default templates produce standard JSON suitable for custom integrations. Don’t change these unless you want all organizations to use a specific format.
Test templates before enabling: Use the channel test endpoint to verify your template produces valid output for the target platform.
Include relevant context per event type:
- New alerts: Include acknowledge URL, version details, severity
- Acknowledged: Include who acknowledged and when
- Resolved: Keep it simple - the alert is no longer actionable
Escape special characters: JSON strings require escaping. Use
\nfor newlines,\"for quotes within strings.Use template functions:
upper,lower, andjsonhelp format output appropriately for different platforms
9.3 Notification Rules
Table: notification_rules
Rules define routing logic: which events go to which channels based on severity and event type.
| Field | Description |
|---|---|
id | Unique identifier |
organization_id | Owning organization |
name | Human-readable name |
severity_filter | Array of severities to match (empty = all) |
event_filter | Array of event types to match (empty = all) |
channel_id | Override channel (NULL = use org default) |
group_interval | Group alerts within this window |
repeat_interval | Don’t repeat for same alert within this window |
is_active | Whether rule is enabled |
priority | Higher priority rules evaluated first |
Event Types:
| Event | Description |
|---|---|
alert.created | New violation detected |
alert.escalated | Severity increased |
alert.acknowledged | Alert marked as acknowledged |
alert.unacknowledged | Acknowledgment reset (re-violation) |
alert.resolved | Version updated, no longer violates |
Rule Evaluation:
1. Get active rules for org, ordered by priority DESC
│
▼
2. For each rule:
├── Check severity_filter (empty = match all)
├── Check event_filter (empty = match all)
│
└── If match:
├── Check group/repeat intervals (prevent spam)
├── Get channel (rule override or org default)
└── Create delivery record
│
▼
3. Deduplicate channels (same alert → same channel only once)
9.4 Delivery Lifecycle
Table: notification_deliveries
Tracks the state and history of each notification delivery attempt.
┌─────────────────────────────────────────────────────────────────┐
│ DELIVERY LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ PENDING │───▶│ IN_PROGRESS │───▶│ SUCCEEDED │ │
│ └─────────┘ └─────────────┘ └───────────┘ │
│ │ │ │
│ │ │ Error/timeout │
│ │ ▼ │
│ │ ┌──────────┐ │
│ │ │ FAILED │◀─────────────────┐ │
│ │ └──────────┘ │ │
│ │ │ │ │
│ │ │ Retry (attempts < max)│ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ IN_PROGRESS │───────────────┘ │
│ │ └─────────────┘ │
│ │ │ │
│ │ │ Max attempts exceeded │
│ │ ▼ │
│ │ ┌─────────────┐ │
│ └────────▶│ DEAD_LETTER │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Status Transitions:
| From | To | Trigger |
|---|---|---|
| - | pending | Dispatcher creates delivery |
| pending | in_progress | Notifier claims delivery (SKIP LOCKED) |
| in_progress | succeeded | 2xx HTTP response |
| in_progress | failed | Error or non-2xx response (retries remain) |
| failed | in_progress | Retry timer expires |
| failed | dead_letter | Max attempts exceeded or 24h TTL |
9.5 Retry Logic
Multi-tier Exponential Backoff:
| Tier | Attempts | Delays | Use Case |
|---|---|---|---|
| Short-term | 1-4 | 10s, 30s, 1m, 5m | Transient errors |
| Mid-term | 5-8 | 15m, 30m, 1h, 2h | Service degradation |
| Long-term | 9-12 | 4h, 4h, 4h, 4h | Extended outage |
Total TTL: ~24 hours from first attempt → dead_letter
Jitter: Full jitter applied (delay = random(0, computedDelay)) to prevent thundering herd.
Retry-After Header: Honored from 429/503 responses.
Non-retryable Errors:
- 4xx responses (except 429) → immediate dead_letter
- Invalid configuration → immediate dead_letter
9.6 Webhook Payload
Default JSON Payload:
{
"idempotency_key": "550e8400-e29b-41d4-a716-446655440000",
"event": "alert.created",
"timestamp": "2026-02-04T12:00:00Z",
"alert": {
"id": 123,
"config_name": "K8s Dashboard Version Check",
"rule_name": "Days Behind",
"severity": "critical",
"discovered_version": "1.25.0",
"latest_version": "1.30.0",
"behind_by": 45,
"artifact_name": "kubernetes/kubernetes",
"repository_url": "https://github.com/org/repo",
"target_file": "chart/Chart.yaml"
},
"acknowledge_url": "https://planekeeper.example.com/api/v1/webhooks/acknowledge/{token}"
}
Headers:
Content-Type: application/jsonX-Planekeeper-Signature: sha256={hmac}(if secret configured)X-Planekeeper-Timestamp: {unix_seconds}X-Planekeeper-Event: alert.createdX-Planekeeper-Idempotency-Key: {uuid}- Custom headers from channel config
HMAC Signature: HMAC-SHA256(secret, timestamp + "." + body)
9.7 Inbound Acknowledgment
Endpoint: POST /api/v1/webhooks/acknowledge/{token}
External systems can acknowledge alerts by calling the acknowledge_url included in the webhook payload.
Token Properties:
- Generated per delivery
- Stored in
notification_ack_tokenstable - Expires after 24 hours (configurable)
- Single-use (marked as used after acknowledgment)
Flow:
External System Planekeeper
│ │
│ POST /webhooks/acknowledge/xyz │
│────────────────────────────────▶│
│ │ Lookup token
│ │ Validate not expired
│ │ Validate not already used
│ │ Mark alert as acknowledged
│ │ Mark token as used
│ 200 OK │
│◀────────────────────────────────│
9.8 SSRF Protection
Webhook URLs are validated to prevent Server-Side Request Forgery:
Blocked by default:
- Private IP ranges (RFC1918):
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 - Localhost:
127.0.0.0/8,::1 - Link-local:
169.254.0.0/16,fe80::/10
Allowed schemes: https:// (default), optionally http://
Environment variable: NOTIFICATION_ALLOW_PRIVATE_URLS=false (default)
9.9 Housekeeping
The notifier service runs periodic cleanup tasks:
| Task | Interval | Description |
|---|---|---|
| Expire ack tokens | 1h | Delete expired tokens from notification_ack_tokens |
| Cleanup expired pending | 1h | Move stuck deliveries to dead_letter after 24h |
| Purge old succeeded | daily | Delete succeeded deliveries older than 30 days |
9.10 API Endpoints
Notification Channels:
| Method | Path | Description |
|---|---|---|
| GET | /notification-channels | List org’s channels |
| POST | /notification-channels | Create channel |
| GET | /notification-channels/{id} | Get channel |
| PUT | /notification-channels/{id} | Update channel |
| DELETE | /notification-channels/{id} | Delete channel |
| POST | /notification-channels/{id}/test | Test channel connectivity |
| POST | /notification-channels/{id}/toggle | Toggle active state |
| GET | /notification-channels/{id}/stats | Get delivery statistics |
9.11 Channel Test Endpoint
The test endpoint (POST /notification-channels/{id}/test) performs comprehensive validation and sends a test notification.
Test Sequence:
- Config Validation: Validate webhook URL and template syntax
- Connectivity Check: Verify URL is reachable (optional HEAD request)
- Sample Delivery: Send actual test payload to webhook
- Record Results: Store test timestamp and success/failure
Response Structure (NotificationChannelTestResult):
{
"success": true,
"tested_at": "2026-02-05T12:00:00Z",
"idempotency_key": "test-550e8400-e29b-41d4-a716-446655440000",
"error": null,
"validation_errors": [],
"connectivity_check": {
"status": 200,
"latency_ms": 150,
"error": null
},
"sample_delivery": {
"status": 200,
"latency_ms": 450,
"response_preview": "OK",
"error": null
}
}
Error Response Examples:
Validation failure:
{
"success": false,
"tested_at": "2026-02-05T12:00:00Z",
"error": "Configuration validation failed",
"validation_errors": ["Invalid webhook URL: private IP addresses not allowed"]
}
Delivery failure:
{
"success": false,
"tested_at": "2026-02-05T12:00:00Z",
"error": "Test delivery failed",
"sample_delivery": {
"status": 400,
"latency_ms": 250,
"response_preview": "{\"message\":\"Invalid content type\"}",
"error": "webhook returned 400"
}
}
9.12 UI Error Handling Pattern
The clientui uses a standardized pattern for surfacing detailed API errors to users.
Error Flow:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ API │───▶│ Services │───▶│ Handler │───▶│ UI │
│ Response │ │ Layer │ │ Formatter │ │ Redirect │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
│ Detailed JSON │ Extract fields │ Build message │ URL-encoded
│ with nested │ (status, error, │ with context │ query param
│ error objects │ preview, etc.) │ (HTTP codes, │ (?error=...)
│ │ │ response text) │
Services Layer (internal/services/api_client.go):
The services layer extracts all error details from API responses:
type NotificationChannelTestResult struct {
Success bool
ErrorMessage *string
TestedAt *time.Time
ValidationErrors []string
// Connectivity check results
ConnectivityStatus *int
ConnectivityError *string
ConnectivityLatency *int64
// Sample delivery results
DeliveryStatus *int
DeliveryError *string
DeliveryLatency *int64
ResponsePreview *string
}
Handler Formatter (internal/handlers/notification_channels.go):
Handlers format user-friendly error messages from detailed results:
func formatTestErrorMessage(result *NotificationChannelTestResult) string {
var parts []string
// Check validation errors first
if len(result.ValidationErrors) > 0 {
parts = append(parts, "Validation errors: "+result.ValidationErrors[0])
}
// Check delivery issues (most common)
if result.DeliveryStatus != nil && *result.DeliveryStatus >= 400 {
msg := "Delivery failed with HTTP " + strconv.Itoa(*result.DeliveryStatus)
if result.ResponsePreview != nil {
msg += " - " + truncate(*result.ResponsePreview, 100)
}
parts = append(parts, msg)
}
// URL-encode for redirect
return urlEncode(strings.Join(parts, "; "))
}
UI Display:
Errors are passed via URL query parameters and displayed in the page template:
/notification-channels/1?error=Delivery+failed+with+HTTP+400+-+%7B%22message%22%3A%22Invalid+payload%22%7D
The template renders this as a styled error banner showing:
Delivery failed with HTTP 400 - {“message”:“Invalid payload”}
Key Principles:
- Preserve Context: Pass HTTP status codes, response bodies, and specific error types through all layers
- Prioritize Actionable Info: Show validation errors first, then HTTP status, then generic errors
- Truncate for Safety: Limit response previews to prevent URL length issues
- URL-Safe Encoding: Properly encode error messages for query parameter use
Notification Rules:
| Method | Path | Description |
|---|---|---|
| GET | /notification-rules | List org’s rules |
| POST | /notification-rules | Create rule |
| GET | /notification-rules/{id} | Get rule |
| PUT | /notification-rules/{id} | Update rule |
| DELETE | /notification-rules/{id} | Delete rule |
| POST | /notification-rules/{id}/toggle | Toggle active state |
Delivery History:
| Method | Path | Description |
|---|---|---|
| GET | /notification-deliveries | List deliveries (filterable) |
| GET | /notification-deliveries/dead | List dead letters |
| POST | /notification-deliveries/{id}/retry | Retry a dead letter |
| GET | /alerts/{id}/deliveries | Deliveries for specific alert |
9.13 Notifier Service
The notifier binary (service_id=7) is a standalone worker that:
- Polls for pending/failed deliveries ready for retry
- Claims deliveries using
FOR UPDATE SKIP LOCKED(distributed locking) - Sends webhooks to configured URLs
- Updates delivery status based on response
- Runs housekeeping tasks periodically
Scaling: Run multiple replicas for horizontal scaling. Each replica claims different deliveries without coordination.
Configuration:
| Variable | Description | Default |
|---|---|---|
NOTIFICATION_BATCH_SIZE | Deliveries to claim per poll | 100 |
NOTIFICATION_POLL_INTERVAL | How often to check for work | 5s |
NOTIFICATION_MAX_RETRIES | Max attempts before dead letter | 12 |
NOTIFICATION_ACK_TOKEN_EXPIRY | Token expiry duration | 24h |
NOTIFICATION_BASE_URL | Base URL for ack callbacks | - |
10. Multi-Tenancy Model
Files: pkg/api/middleware.go, pkg/api/middleware_auth.go, pkg/storage/queries/gather_jobs.sql, pkg/storage/queries/organization_members.sql
10.1 Authentication Methods
The system supports two authentication methods:
| Method | Used By | How It Works |
|---|---|---|
| API Key | Agents, machines, InternalUI | X-API-Key header or planekeeper_api_key cookie |
| Supabase JWT | Human users (ClientUI) | Authorization: Bearer header + X-Organization-Id header |
Dual Auth Middleware (pkg/api/middleware_auth.go):
The API server tries JWT auth first, then falls back to API key auth. When Supabase is not configured (no SUPABASE_JWT_SECRET), only API key auth is available.
Incoming Request
│
├── Has "Authorization: Bearer" header?
│ ├── YES → Validate JWT → Lookup user by supabase_id
│ │ → Read X-Organization-Id → Verify membership → Allow
│ └── NO → Fall through
│
├── Has "X-API-Key" header or cookie?
│ ├── YES → Validate API key → Extract org_id → Allow
│ └── NO → 401 Unauthorized
10.2 Organization Scoping
All API requests are scoped to an organization:
- API Key auth: Organization derived from
api_keys.organization_idin the database - JWT auth: Organization specified by the
X-Organization-Idheader, validated against the user’s membership
organization_id stored in request context (Locals)
│
▼
All queries filter by organization_id
10.3 Multi-Organization Membership
Users can belong to multiple organizations via the organization_members join table.
Schema:
| Table | Key Columns | Purpose |
|---|---|---|
organization_members | user_id, organization_id, role | Membership records |
organization_invites | email, organization_id, role, token | Pending invitations |
Roles (org_role enum):
| Role | Capabilities |
|---|---|
owner | Full control, manage members |
admin | Manage resources, invite members |
member | Read/write org resources |
Membership Flow:
User signs up (Supabase) → is_approved = FALSE → Pending approval page
│
▼
Admin approves (SQL: UPDATE users SET is_approved = TRUE WHERE email = '...')
│
▼
User logs out and back in → No memberships → Onboarding page
│
├── Create new organization → Owner membership created
│
└── Accept pending invite → Member/admin membership created
10.8 Beta User Approval Gating
Files: pkg/storage/migration/sql/023_user_approval.sql, pkg/auth/middleware.go, internal/handlers/auth.go
New user signups are blocked until an administrator manually approves them. This is enforced via the is_approved column on the users table.
How it works:
| Scenario | is_approved | Behavior |
|---|---|---|
| Existing users (pre-migration) | TRUE (default) | Unaffected |
| System users | TRUE (default) | Unaffected |
| New Supabase signups | FALSE (explicit) | Redirected to pending approval page |
| Admin-approved users | TRUE (manual SQL) | Normal login flow |
Enforcement points (defense-in-depth):
processLogin(primary): After finding/creating the user, checksis_approved. If false, redirects to/pending-approvalinstead of checking org membership.RequireOnboardedmiddleware (secondary): Checkssession.IsApprovedand redirects to/pending-approval. Prevents bypassing via direct URL access.
Session caching: The is_approved value is stored in the encrypted session cookie (SessionData.IsApproved). Users must log out and back in after being approved for the change to take effect.
Admin approval (SQL):
-- Approve a user
UPDATE users SET is_approved = TRUE WHERE email = 'user@example.com';
-- List unapproved users
SELECT id, email, created_at FROM users WHERE is_approved = FALSE AND is_system = FALSE;
10.4 Organization Switching (ClientUI)
Users with multiple org memberships can switch between them. The active organization is stored in an HTTP-only cookie (planekeeper_org).
POST /switch-org (form: org_id)
│
▼
Validate user is a member of target org
│
▼
Update planekeeper_org cookie
│
▼
Redirect to dashboard (now showing new org's data)
The sidebar displays an organization dropdown when the user belongs to multiple orgs.
10.5 Scope Types
| Scope | organization_id | is_global | Use Case |
|---|---|---|---|
| Organization | Valid (e.g., 123) | false | Tenant-specific resources |
| Global | NULL | true | Shared across all orgs |
10.6 System API Keys
Identification: organization_id = 0 in auth context (NULL in database).
Capabilities:
- Create global resources (
is_global = true) - Access cross-organization data
- Manage system settings
Creation Flow:
if isSystemKey {
orgIDParam = pgtype.Int8{} // NULL
isGlobal = pgtype.Bool{Bool: true, Valid: true}
}
10.7 List Query Scopes
Most list endpoints support a scope query parameter:
| Scope | Query Filter |
|---|---|
org | WHERE organization_id = @org_id AND is_global = FALSE |
global | WHERE is_global = TRUE |
all (default) | WHERE organization_id = @org_id OR is_global = TRUE |
11. Agent Communication
Files: pkg/agent/worker.go, pkg/api/agents_handler.go
Heartbeat Protocol
Endpoint: POST /heartbeat/{AgentUUID}
Request Body (optional):
{
"capabilities": ["gather", "scrape", "helm_sync"],
"available_credentials": ["github_token", "ssh_key"]
}
Response:
{
"poll_interval_seconds": 30,
"rate_limit_max_requests": 100,
"rate_limit_window_seconds": 60
}
Server-Side:
- Upserts agent into
service_instancestable - Stores metadata (build_date, capabilities, credentials) as JSON
- Updates
last_heartbeattimestamp
Capability Declaration
Agents declare supported job types during heartbeat:
| Capability | Job Type |
|---|---|
gather | Fetch upstream releases |
scrape | Extract deployed versions |
helm_sync | Discover Helm charts |
Credential Declaration
Agents declare available credentials for credential-aware job assignment:
// Worker.GetAvailableCredentials()
availableCredentials := w.GetAvailableCredentials()
task, err := w.client.PollTaskWithCredentials(ctx, jobTypes, availableCredentials)
Orphan Cleanup
Service: OrphanCleanupService (runs every 2 minutes)
Logic:
- Find jobs claimed by agents not in
service_instances - Reset those jobs to
pendingstatus - Additional: Reset jobs in
in_progressfor >1 hour (stale detection)
-- ResetOrphanedGatherJobs
UPDATE gather_jobs
SET status = 'pending', claimed_by = NULL
WHERE claimed_by NOT IN (
SELECT instance_uuid FROM service_instances WHERE service_id = 2
)
12. Metrics API
Files: pkg/api/metrics_handler.go, pkg/storage/queries/system_metrics.sql
Purpose
Expose system-wide operational metrics for monitoring and observability. The metrics endpoint provides a comprehensive view of system health across all organizations, including job status, alert state, service health, and task execution performance. Supports both JSON format for API consumers and Prometheus text format for integration with monitoring systems.
Endpoint
GET /api/v1/internal/metrics
Output Format
| Request | Response Format | Content-Type |
|---|---|---|
/metrics (default) | Prometheus exposition format | text/plain; version=0.0.4; charset=utf-8 |
/metrics?format=json | JSON | application/json |
Default: Prometheus text format for easy integration with Prometheus scrapers.
Use ?format=json query parameter for programmatic access via JSON.
Authentication
No authentication required. This endpoint is only exposed via the internal Traefik reverse proxy (port 8443), which is restricted to trusted IPs by the hosting provider’s firewall. Security is provided by network-level access control rather than API key authentication.
Metric Categories
Organization Metrics
| Metric | Description |
|---|---|
organizations.total | Total number of organizations |
organizations.active | Number of active organizations |
Service Instance Metrics
| Metric | Description |
|---|---|
services[].service_name | Name of the service (server, agent, taskengine, etc.) |
services[].total | Total instances of this service |
services[].healthy | Instances with heartbeat in last 5 minutes |
services[].unhealthy | Instances without recent heartbeat |
Health Threshold: 5 minutes since last heartbeat.
Agent Metrics
| Metric | Description |
|---|---|
agents.total | Total registered agents (system-wide) |
agents.healthy | Agents with heartbeat in last 5 minutes |
agents.unhealthy | Agents without recent heartbeat |
Job Status Metrics
| Metric | Description |
|---|---|
jobs.gather | Count of gather jobs by status (pending, in_progress, completed, failed) |
jobs.scrape | Count of scrape jobs by status (pending, in_progress, completed, failed) |
jobs.helm_sync | Count of helm sync jobs by status (pending, in_progress, completed, failed) |
Scope: Includes all jobs across all organizations.
Alert Metrics
| Metric | Description |
|---|---|
alerts.total | Total alert count (system-wide) |
alerts.unacknowledged | Alerts requiring attention |
alerts.by_severity.critical | Critical severity alerts (unacknowledged) |
alerts.by_severity.high | High severity alerts (unacknowledged) |
alerts.by_severity.moderate | Moderate severity alerts (unacknowledged) |
Release Metrics
| Metric | Description |
|---|---|
releases.total | Total tracked upstream releases |
releases.stable | Non-prerelease versions |
releases.prerelease | Prerelease versions (alpha, beta, rc, etc.) |
releases.unique_artifacts | Distinct artifact names being tracked |
Task Execution Metrics
| Metric | Description |
|---|---|
task_executions.total | Total task executions in last 24 hours |
task_executions.completed | Successful completions |
task_executions.failed | Failed executions |
task_executions.in_progress | Currently running |
task_executions.success_rate | Completion rate (0-1) |
Time Window: Last 24 hours only.
API Key Metrics
| Metric | Description |
|---|---|
api_keys.total | Total number of API keys |
api_keys.active | Number of active API keys |
api_keys.system | Number of system API keys |
Prometheus Metric Names
All metrics use gauge type (not counters) following Prometheus naming conventions. No org_id labels - these are system-wide metrics:
| Prometheus Metric | Labels | Description |
|---|---|---|
planekeeper_organizations | - | Total organizations |
planekeeper_organizations_active | - | Active organizations |
planekeeper_service_instances | service, status | Service instances by type and health |
planekeeper_agents | - | Total agents |
planekeeper_agents_healthy | - | Healthy agents |
planekeeper_agents_unhealthy | - | Unhealthy agents |
planekeeper_jobs | type, status | Jobs by type and status |
planekeeper_alerts | - | Total alerts |
planekeeper_alerts_unacknowledged | - | Unacknowledged alerts |
planekeeper_alerts_by_severity | severity | Alerts by severity |
planekeeper_releases | - | Total releases |
planekeeper_releases_stable | - | Stable releases |
planekeeper_releases_prerelease | - | Prerelease versions |
planekeeper_releases_unique_artifacts | - | Unique artifacts |
planekeeper_task_executions_24h | - | Task executions (24h) |
planekeeper_task_executions_24h_completed | - | Completed tasks (24h) |
planekeeper_task_executions_24h_failed | - | Failed tasks (24h) |
planekeeper_task_executions_24h_in_progress | - | In-progress tasks |
planekeeper_task_executions_24h_success_rate | - | Success rate |
planekeeper_api_keys | - | Total API keys |
planekeeper_api_keys_active | - | Active API keys |
planekeeper_api_keys_system | - | System API keys |
planekeeper_audit_writer_events_written_total | - | Total audit trail entries written (counter) |
planekeeper_audit_writer_persist_errors_total | - | Total audit trail write failures (counter) |
planekeeper_audit_writer_insert_duration_seconds_total | - | Cumulative audit trail insert duration in seconds (counter) |
Naming Convention: Metrics use gauge type without _total suffix (per Prometheus best practices - _total is reserved for counters). Exception: audit writer metrics use counter type with _total suffix since they track cumulative totals.
Prometheus Scrape Configuration
scrape_configs:
- job_name: 'planekeeper'
static_configs:
- targets: ['localhost:8443'] # Internal Traefik only
metrics_path: '/api/v1/internal/metrics'
# No authentication required - endpoint is on internal network
# Default output is Prometheus format, no params needed
Important: The metrics endpoint is only accessible via the internal Traefik (port 8443). Ensure this port is restricted to trusted IPs via your hosting provider’s firewall.
Testing the Endpoint
# Prometheus format (default)
curl https://localhost:8443/api/v1/internal/metrics
# JSON format
curl https://localhost:8443/api/v1/internal/metrics?format=json
Example Prometheus Output
# HELP planekeeper_organizations Total number of organizations
# TYPE planekeeper_organizations gauge
planekeeper_organizations 5
# HELP planekeeper_organizations_active Number of active organizations
# TYPE planekeeper_organizations_active gauge
planekeeper_organizations_active 4
# HELP planekeeper_service_instances Service instances by type and status
# TYPE planekeeper_service_instances gauge
planekeeper_service_instances{service="server",status="healthy"} 2
planekeeper_service_instances{service="agent",status="healthy"} 3
planekeeper_service_instances{service="agent",status="unhealthy"} 1
planekeeper_service_instances{service="taskengine",status="healthy"} 1
# HELP planekeeper_agents Total number of registered agents
# TYPE planekeeper_agents gauge
planekeeper_agents 4
# HELP planekeeper_agents_healthy Number of healthy agents with recent heartbeat
# TYPE planekeeper_agents_healthy gauge
planekeeper_agents_healthy 3
# HELP planekeeper_agents_unhealthy Number of unhealthy agents without recent heartbeat
# TYPE planekeeper_agents_unhealthy gauge
planekeeper_agents_unhealthy 1
# HELP planekeeper_jobs Jobs by type and status
# TYPE planekeeper_jobs gauge
planekeeper_jobs{type="gather",status="pending"} 5
planekeeper_jobs{type="gather",status="completed"} 150
planekeeper_jobs{type="scrape",status="pending"} 3
planekeeper_jobs{type="scrape",status="completed"} 200
planekeeper_jobs{type="helm_sync",status="completed"} 10
# HELP planekeeper_alerts Total number of alerts
# TYPE planekeeper_alerts gauge
planekeeper_alerts 50
# HELP planekeeper_alerts_unacknowledged Number of unacknowledged alerts
# TYPE planekeeper_alerts_unacknowledged gauge
planekeeper_alerts_unacknowledged 12
# HELP planekeeper_alerts_by_severity Unacknowledged alerts by severity level
# TYPE planekeeper_alerts_by_severity gauge
planekeeper_alerts_by_severity{severity="critical"} 2
planekeeper_alerts_by_severity{severity="high"} 5
planekeeper_alerts_by_severity{severity="moderate"} 5
# HELP planekeeper_releases Total number of tracked upstream releases
# TYPE planekeeper_releases gauge
planekeeper_releases 500
# HELP planekeeper_releases_stable Number of stable (non-prerelease) releases
# TYPE planekeeper_releases_stable gauge
planekeeper_releases_stable 450
# HELP planekeeper_releases_prerelease Number of prerelease versions
# TYPE planekeeper_releases_prerelease gauge
planekeeper_releases_prerelease 50
# HELP planekeeper_releases_unique_artifacts Number of unique artifact names being tracked
# TYPE planekeeper_releases_unique_artifacts gauge
planekeeper_releases_unique_artifacts 25
# HELP planekeeper_task_executions_24h Task executions in last 24 hours
# TYPE planekeeper_task_executions_24h gauge
planekeeper_task_executions_24h 500
# HELP planekeeper_task_executions_24h_completed Completed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_completed gauge
planekeeper_task_executions_24h_completed 485
# HELP planekeeper_task_executions_24h_failed Failed task executions in last 24 hours
# TYPE planekeeper_task_executions_24h_failed gauge
planekeeper_task_executions_24h_failed 10
# HELP planekeeper_task_executions_24h_in_progress Task executions currently in progress
# TYPE planekeeper_task_executions_24h_in_progress gauge
planekeeper_task_executions_24h_in_progress 5
# HELP planekeeper_task_executions_24h_success_rate Task execution success rate (0-1)
# TYPE planekeeper_task_executions_24h_success_rate gauge
planekeeper_task_executions_24h_success_rate 0.9700
# HELP planekeeper_api_keys Total number of API keys
# TYPE planekeeper_api_keys gauge
planekeeper_api_keys 10
# HELP planekeeper_api_keys_active Number of active API keys
# TYPE planekeeper_api_keys_active gauge
planekeeper_api_keys_active 8
# HELP planekeeper_api_keys_system Number of system API keys
# TYPE planekeeper_api_keys_system gauge
planekeeper_api_keys_system 2
Example JSON Response
{
"collected_at": "2026-02-04T12:00:00Z",
"organizations": {
"total": 5,
"active": 4
},
"services": [
{
"service_name": "server",
"total": 2,
"healthy": 2,
"unhealthy": 0
},
{
"service_name": "agent",
"total": 4,
"healthy": 3,
"unhealthy": 1
},
{
"service_name": "taskengine",
"total": 1,
"healthy": 1,
"unhealthy": 0
}
],
"agents": {
"total": 4,
"healthy": 3,
"unhealthy": 1
},
"jobs": {
"gather": {
"pending": 5,
"in_progress": 2,
"completed": 150,
"failed": 3
},
"scrape": {
"pending": 3,
"in_progress": 1,
"completed": 200,
"failed": 2
},
"helm_sync": {
"pending": 0,
"in_progress": 0,
"completed": 10,
"failed": 0
}
},
"alerts": {
"total": 50,
"unacknowledged": 12,
"by_severity": {
"critical": 2,
"high": 5,
"moderate": 5
}
},
"releases": {
"total": 500,
"stable": 450,
"prerelease": 50,
"unique_artifacts": 25
},
"task_executions": {
"total": 500,
"completed": 485,
"failed": 10,
"in_progress": 5,
"success_rate": 0.97
},
"api_keys": {
"total": 10,
"active": 8,
"system": 2
}
}
Error Handling
| Status | Condition |
|---|---|
| 200 | Success |
| 500 | Database query failure |
Security
The metrics endpoint is unauthenticated but secured through network isolation:
Internal Traefik Only: The endpoint is registered on the internal API path (
/api/v1/internal/metrics) and only exposed via the internal Traefik reverse proxy on port 8443, which is restricted to trusted IPs by the hosting provider’s firewall.Not Publicly Routed: The public Traefik (
dynamic-public.yml) does not include routing rules for/api/v1/internal/metrics.Access Methods: To access the metrics endpoint:
- Direct (with firewall rule):
curl https://<server-ip>:8443/api/v1/internal/metrics - From the same host:
curl https://localhost:8443/api/v1/internal/metrics - Via SSH tunnel:
ssh -L 8443:localhost:8443 user@serverthencurl https://localhost:8443/api/v1/internal/metrics - Via VPN: Route traffic to the internal port
- Direct (with firewall rule):
Side Effects
None - this is a read-only endpoint.
Additional URLs
https://o11y.tools/metricslint/ https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md
13. Developer Tools
API Documentation (Swagger UI)
The server hosts interactive API documentation interfaces using Swagger UI, with separate documentation for client and internal APIs.
Endpoints:
| Endpoint | Description |
|---|---|
/api/v1/swagger | Client API Swagger UI (org-scoped endpoints) |
/api/v1/internal/swagger | Internal API Swagger UI (system/agent endpoints) |
/api/spec/openapi-client.yaml | Client API OpenAPI specification |
/api/spec/openapi-internal.yaml | Internal API OpenAPI specification |
/api/spec/openapi-shared.yaml | Shared paths referenced by both specs |
/api/spec/openapi.yaml | Combined specification (for codegen) |
/api/spec/components/* | Shared component schema files |
API Separation:
| API | Base Path | Purpose | Endpoints |
|---|---|---|---|
| Client | /api/v1/client | Organization-scoped operations | Jobs, releases, rules, alerts, dropdowns, org settings |
| Internal | /api/v1/internal | System/agent operations | Heartbeat, tasks, metrics, global settings |
Client API includes:
- Job management (gather, scrape, helm-sync)
- Releases and versions viewing
- Rules and alert management
- Alert configurations
- Dropdown data for UI forms
- Organization-specific settings overrides
Internal API includes:
- Agent heartbeat registration
- Task polling and completion (for agents)
- Prometheus metrics endpoint
- Global settings management
Shared Paths (available in both APIs):
/gather-jobs/*,/scrape-jobs/*,/helm-sync-jobs/*/releases/*,/versions/settings(GET only in client, GET+PUT in internal)/agents,/validate/regex
Usage:
- Start the server:
go run ./app/server - Navigate to:
http://localhost:3000/api/v1/swaggerfor client API docshttp://localhost:3000/api/v1/internal/swaggerfor internal API docs
- Click “Authorize” and enter your API key (
pk_<id>_<secret>) - Use “Try it out” on any endpoint to execute requests
Implementation (app/server/main.go):
// Serve API specs for Swagger UI
app.Static("/api/spec", "./api")
// Client API Swagger UI at /api/v1/swagger
app.Static("/api/v1/swagger", "./internal/static/swagger-client")
// Internal API Swagger UI at /api/v1/internal/swagger
app.Static("/api/v1/internal/swagger", "./internal/static/swagger-internal")
OpenAPI Spec Structure:
api/
├── openapi.yaml # Master spec for codegen (references all others)
├── openapi-shared.yaml # Shared paths (no duplication)
├── openapi-client.yaml # Client API (references shared + client-only)
├── openapi-internal.yaml # Internal API (references shared + internal-only)
└── components/ # Shared schemas, parameters, responses
├── schemas.yaml
├── parameters.yaml
├── responses.yaml
└── securitySchemes.yaml
Code Generation:
The bazel run //toolchains/oapi-codegen target bundles openapi.yaml (which references all specs) to generate server handlers and client code. The split specs are used only for Swagger UI documentation.
Files:
internal/static/swagger-client/index.html- Client API Swagger UIinternal/static/swagger-internal/index.html- Internal API Swagger UIapi/openapi.yaml- Master specification for codegenapi/openapi-client.yaml- Client API specificationapi/openapi-internal.yaml- Internal API specificationapi/openapi-shared.yaml- Shared paths (single source of truth)api/components/- Shared schema, parameter, and response definitions
14. Admin UI
Files: internal/handlers/, internal/templates/, internal/services/api_client.go, internal/middleware/
Purpose
The Admin UI provides server-rendered HTML interfaces for managing all Planekeeper resources. Two UI binaries exist — clientui (organization-scoped, public-facing) and internalui (system-scoped, admin-only). Both consume the REST API through an HTTP client wrapper and render pages using templ templates with Tailwind CSS.
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Browser │────▶│ Handler │────▶│ API Client │────▶│ REST API │
│ │ │ (Fiber) │ │ (services) │ │ (Server) │
└──────────────┘ └──────┬───────┘ └──────────────┘ └──────────────┘
│
┌──────┴───────┐
│ templ │
│ Templates │
│ (pages + │
│ components) │
└──────────────┘
The UI never accesses the database directly. All data flows through the REST API via internal/services/api_client.go.
14.1 Authentication
The two UI binaries use different authentication methods:
| UI | Auth Method | Cookie | Description |
|---|---|---|---|
| ClientUI | Supabase Auth (preferred) or API Key (legacy) | planekeeper_session + planekeeper_org | Human users via email/password or OAuth |
| InternalUI | API Key only | planekeeper_api_key | Admin users via system API key |
When Supabase is not configured (SUPABASE_JWT_SECRET not set), ClientUI falls back to API key login — the same flow as InternalUI.
Supabase Auth (ClientUI)
Login/Signup Options:
- Email/Password: Client-side Supabase JS SDK handles
signInWithPassword(login) andsignUp(registration), then exchanges tokens with the server viaPOST /auth/token-exchange - OAuth (auto-detected): OAuth buttons appear on both login and signup pages. On startup, ClientUI calls
GET /auth/v1/settingsto discover which providers are enabled in the Supabase project (e.g., GitHub, Google). Only enabled providers are shown. Supabase’ssignInWithOAuthhandles both sign-in and sign-up automatically — users who don’t have an account are created on first OAuth login. - Provider auto-detection: The
OAuthProviders []stringfield onUIConfigis populated at startup and passed to both login and signup page templates. If the settings endpoint is unreachable, OAuth buttons are silently omitted.
Session Cookies:
planekeeper_session: AES-GCM encrypted blob containing access_token, refresh_token, expiry, user_id, email, supabase_id, is_approved. HTTP-only, 7-day expiry.planekeeper_org: Active organization ID (plain int64). Validated against membership on every request.
Login Flow (email/password):
POST /auth/login (email + password)
│
▼
Server calls Supabase /auth/v1/token?grant_type=password
│
▼
Find or create user in DB (by supabase_id or email)
│ (new users created with is_approved = FALSE)
│
▼
Set planekeeper_session cookie (encrypted, includes is_approved)
│
├── User approved (is_approved = TRUE)?
│ ├── NO → Redirect to /pending-approval
│ └── YES ↓
│
├── User has org memberships?
│ ├── YES → Set planekeeper_org cookie → Redirect to /dashboard
│ └── NO → Redirect to /onboarding
OAuth Flow (client-side, via Supabase JS SDK):
User clicks OAuth button on /login or /auth/signup
│
▼
Supabase JS SDK calls signInWithOAuth({ provider, redirectTo })
│
▼
Browser redirects to Supabase /auth/v1/authorize → provider (e.g., GitHub)
│
▼
User authenticates with provider
│
▼
Provider redirects back to Supabase → Supabase redirects to AUTH_CALLBACK_URL
│
▼
GET /auth/callback (tokens in URL hash)
│
▼
Callback page JS extracts tokens → POST /auth/token-exchange
│
▼
Server validates JWT, finds/creates user → Set session → Dashboard or onboarding
Onboarding (first login, no org memberships):
GET /onboarding
│
├── Check pending invites by email
│
▼
Show "Create Organization" form + pending invite list
│
├── POST /onboarding/create-org → Create org + owner membership → Dashboard
└── POST /onboarding/accept-invite/:token → Create membership → Dashboard
Token Refresh: The auth middleware transparently refreshes expired JWTs using the stored refresh token. Updated tokens are re-encrypted and written back to the session cookie.
Middleware (pkg/auth/middleware.go):
RequireAuth: Validates session cookie, refreshes expired JWT, sets context locals (user_id, email, supabase_id)RequireOnboarded: Checks user is approved (redirects to/pending-approvalif not), then checks org membership and active org cookie (redirects to/onboardingif not)
API Key Auth (InternalUI + Legacy ClientUI)
Login Flow:
- User navigates to
/login(unauthenticated) - Enters API key in form
- Handler validates key, stores in HTTP-only cookie
planekeeper_api_key(24-hour expiry) - All subsequent requests include cookie automatically
Middleware (internal/middleware/api_key.go):
- Checks
X-API-Keyheader first, thenplanekeeper_api_keycookie - On failure: redirects browser requests to login page, returns 401 for API requests
- ClientUI requires organization-scoped keys
- InternalUI requires system-scoped keys (
organization_id = 0)
API Client Construction
All handlers use a shared helper to construct API clients from the current auth context:
getAPIClient(c, cfg)
│
├── Has Supabase session? → NewAPIClientWithJWT(accessToken, orgID)
│ (uses Authorization: Bearer + X-Organization-Id headers)
│
└── Has API key? → NewAPIClient(apiKey)
(uses X-API-Key header)
This dual-path helper allows all handlers to work identically regardless of auth method.
14.2 Navigation Structure
Client UI (organization-scoped resources):
| Section | Page | Route |
|---|---|---|
| Overview | Dashboard | /dashboard |
| Jobs | Gather Jobs | /jobs |
| Jobs | Scrape Jobs | /scrape-jobs |
| Data | Releases | /releases |
| Rules | Monitoring Rules | /rules |
| Rules | Alert Configs | /alert-configs |
| Monitoring | Alerts | /alerts |
| Notifications | Channels | /notification-channels |
| Notifications | Rules | /notification-rules |
| Notifications | Settings | /notification-settings |
| Notifications | Deliveries | /notification-deliveries |
Internal UI (global/system resources):
| Section | Page | Route |
|---|---|---|
| Jobs | Gather Jobs | /jobs |
| System | Agents | /agents |
| System | Settings | /settings |
14.3 Page Patterns
Every list page follows a consistent pattern:
┌─────────────────────────────────────────┐
│ Title [Create] btn │
├─────────────────────────────────────────┤
│ Success/Error banner (from query param)│
├─────────────────────────────────────────┤
│ Inline create form (if ?new=true) │
├─────────────────────────────────────────┤
│ ┌─────────────────────────────────┐ │
│ │ Table with headers │ │
│ │ Row 1 | Row 2 | Actions │ │
│ │ ... │ │
│ │ "No items found" (if empty) │ │
│ ├─────────────────────────────────┤ │
│ │ Pagination (Prev | Next) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
Every detail/edit page follows:
┌─────────────────────────────────────────┐
│ ← Back to List Page Title │
├─────────────────────────────────────────┤
│ Success/Error banner │
├─────────────────────────────────────────┤
│ Edit form with fields │
│ [Cancel] [Save Changes] │
├─────────────────────────────────────────┤
│ Danger Zone: [Delete] │
├─────────────────────────────────────────┤
│ Created: YYYY-MM-DD | Updated: ... │
└─────────────────────────────────────────┘
14.4 Form Handling Flow
Standard CRUD Flow:
GET /resource?new=true → Render list page with inline create form
POST /resource → Parse form, call API, redirect with ?success= or ?error=
GET /resource/{id} → Render detail/edit page
POST /resource/{id} → Parse form, call API, redirect with result
POST /resource/{id}/delete → Call API, redirect to list with result
POST /resource/{id}/toggle → Toggle active state (HTMX or redirect)
Error Message Flow:
API Response (JSON) → Services Layer (extract fields) → Handler (format message) → URL redirect (?error=encoded_msg)
Errors are passed as URL query parameters for stateless handling across redirects:
/notification-channels/1?error=Delivery+failed+with+HTTP+400
/rules?success=Rule+created+successfully
Error Sanitization (internal/handlers/form_helpers.go):
- Internal errors (connection refused, context deadline, database constraint violations) are mapped to generic user-friendly messages
- Message length capped at 100 characters
- Prevents leaking sensitive database/system details to the UI
Graceful Degradation: When API calls fail, pages render with empty data rather than showing error pages. The dashboard, for example, renders empty metrics if GetDashboardStats fails.
14.5 HTMX Integration
Several pages use HTMX for partial page updates without full reloads:
| Feature | Pattern | Pages |
|---|---|---|
| Active toggle | hx-post to toggle endpoint, hx-target replaces row, hx-swap="outerHTML" | Rules, alert configs, notification channels, notification rules |
| Settings edit | hx-get loads inline edit form, hx-put saves, hx-target replaces cell | Settings |
| Alert acknowledge | hx-post to acknowledge endpoint, replaces alert row | Alerts |
Toggle Handler Logic:
if request has "HX-Request" header:
→ render and return updated table row only (outerHTML swap)
else:
→ redirect to list page with success message
14.6 Scope Filtering
The dashboard and jobs pages support multi-tenant scope filtering via a dropdown:
| Scope | Label | Behavior |
|---|---|---|
org | Organization | Shows only the current organization’s resources |
global | Global | Shows only globally-shared resources |
all | All | Shows both organization and global resources |
Default Scopes:
- Dashboard defaults to
org(most relevant view for operators) - Jobs page defaults to
org(consistent with dashboard; operators can switch to “all” to see global jobs)
The scope parameter is passed through to the API’s list endpoints which apply the corresponding SQL filter (see Section 10: Multi-Tenancy Model).
14.7 Pagination
List pages support offset-based pagination via query parameters:
| Parameter | Default | Validation |
|---|---|---|
limit | 50 | Must be 1-100 |
offset | 0 | Must be ≥ 0 |
The pagination component renders Previous/Next links and a “Showing X-Y” counter. It determines whether a “Next” link is needed by checking if itemCount == limit (indicating more items may exist).
14.8 Shared Component Library
Reusable templ components in internal/templates/components/ enforce consistent UI patterns across all pages:
| Component | Purpose | Business Logic |
|---|---|---|
ScopeFilter | Scope dropdown + optional search | Renders All/Organization/Global options, preserves current selection |
ActiveToggle | HTMX toggle badge | Green “Active” / gray “Inactive”, posts to toggle endpoint |
ActionCell | Edit + Delete buttons | Edit link + delete form with confirmation dialog |
EmptyRow | Empty state message | Spans configurable number of table columns |
FormButtons | Cancel + Submit pair | Cancel links back to list, submit posts form |
DetailPageHeader | Back link + title | Consistent navigation on detail pages |
Timestamps | Created/Updated footer | Formats as YYYY-MM-DD HH:MM:SS |
FormCard | Card wrapper with title | White card with close button, uses slot pattern |
Pagination | Page navigation | Previous/Next with offset arithmetic |
MetricCard | Dashboard stat card | Large number with label and icon |
StatusBadge | Job status indicator | Color-coded: pending=amber, in_progress=blue, completed=green, failed=red |
SeverityBadge | Alert severity indicator | critical=red, high=orange, moderate=yellow |
RuleTypeBadge | Rule type indicator | days_behind=purple, majors_behind=indigo, minors_behind=cyan |
HealthBadge | Agent health indicator | green “Healthy” / red “Unhealthy” |
ChannelTypeBadge | Channel type indicator | webhook=purple, pagerduty=green, telegram=blue, smtp=yellow |
DeliveryStatusBadge | Delivery status indicator | pending=yellow, succeeded=green, failed=red, dead_letter=dark |
EventTypeBadge | Alert event indicator | created=blue, escalated=orange, acknowledged=green, resolved=purple |
BoolBadge | Generic yes/no indicator | Configurable true/false labels with green/gray colors |
ErrorMessage | Error banner | Red banner, renders only when message is non-empty |
SuccessMessage | Success banner | Green banner, renders only when message is non-empty |
14.9 Page-Specific Business Logic
Dashboard: Aggregates job statistics (total, pending, completed) and system health into four metric cards. Displays recent gather jobs in a table sorted by creation date.
Gather Jobs: Form validation includes name, artifact name, source type, cron schedule, tag filter, and version regex. Supports “Run Now” to trigger immediate execution and “Clear Releases” to purge cached upstream data.
Scrape Jobs: Displays the latest version snapshot inline with each job row. Form includes regex validation endpoint. Version history page shows historical snapshots with configurable limit (1-20).
Releases: Two view modes — flat list and grouped by artifact. Supports filters for artifact name (autocomplete from known artifacts), version text, sort order (newest/oldest first), and prerelease inclusion toggle. Summary bar shows total count, unique artifacts, and stable release count.
Rules: Three rule types with tiered thresholds (moderate ≤ high ≤ critical). Threshold values are contextual — “days” for days_behind, “versions” for majors/minors_behind. “Evaluate All” button triggers rule evaluation across all active alert configs.
Alert Configs: Links three resources (scrape job + gather job + rule) into a monitoring configuration. Form dropdowns are populated dynamically from available resources. Displays rule type badge alongside rule name.
Alerts: Filters by acknowledgment status and severity. Summary panel shows counts by severity. Supports single and bulk acknowledgment. Table rows are color-coded by severity (red border for critical, orange for high, yellow for moderate).
Notification Channels: Channel detail page includes “Test Channel” button that sends a test webhook and displays detailed results (HTTP status, response preview, latency). Event-specific template editor with toggle to enable/disable, showing inherited global defaults as collapsible previews.
Notification Settings: Organization-level default channel selection. Per-category template management (new_alert, acknowledged, resolved) with “Reset to Global” option for organization overrides.
Settings: Combined view showing global defaults alongside organization overrides. Inline HTMX editing — click “Set Override” to enter edit mode, “Clear” to remove override and fall back to global default.
15. Open Questions & Ambiguities
1. Retry Exhaustion Recovery
Issue: When a job reaches max_attempts and enters failed status, there is no automatic recovery mechanism.
Current Behavior: Job remains in failed status indefinitely until manual intervention.
Possible Solutions:
- Manual trigger via
TriggerJobendpoint - Admin UI “retry” button
- Automatic reset after configurable cooldown period
2. Token Expiry vs Stale Reset Overlap
Issue: Execution tokens expire after ~5 minutes (configurable), but stale job reset happens after 1 hour.
Scenario:
- Agent claims job, gets 5-minute token
- Agent crashes at minute 3
- Token expires at minute 5
- Job remains
in_progressuntil minute 60 (stale reset)
Question: Should there be intermediate recovery (e.g., 15-minute token expiry detection)?
3. Global Jobs Organization Association
Issue: Global jobs (is_global = true) have organization_id = NULL, but releases created by these jobs use organization_id = 1 (Global org).
Implication: Query logic must account for both NULL and 1 when listing global releases.
4. Minors Behind Formula Discrepancy
Issue: The formula-based calculation for minors_behind may differ from release-list counting.
Example:
- Formula: 6.11 → 8.1 = (8-6) + 1 = 3 minors behind
- Release-list might show: 7.0, 7.1, 7.2, 8.0, 8.1 = 5 minors behind
Question: Should formula fallback be documented as “approximate” in alerts?
5. Alert History Retention
Current Behavior: Resolved alerts are soft-deleted and preserved indefinitely.
Potential Improvements:
- Add configurable retention period (e.g., 90 days)
- Add
PurgeOldResolvedAlertscleanup job to taskengine - Consider archival to separate table for very old alerts
Note: The PurgeOldResolvedAlerts query exists but is not currently called by any scheduled job.
16. Regression Test Recommendations
Gather Jobs Domain
| Test Case | Description |
|---|---|
TestGatherJob_GitHub_RateLimitHandling | Verify proper error message and retry behavior on 403/429 |
TestGatherJob_GitHub_Pagination | Ensure all 1000 releases fetched across 10 pages |
TestGatherJob_Helm_LargeIndex | Verify 50MB limit enforced, graceful error |
TestGatherJob_VersionRegex_CaptureGroup | Confirm capture group extraction vs full match |
TestGatherJob_StateTransition_MaxAttempts | Verify pending→failed after max_attempts |
TestGatherJob_Reschedule_CronExpression | Verify next_run_at calculation accuracy |
TestGatherJob_OrphanRecovery | Jobs released when agent disconnects |
Scrape Jobs Domain
| Test Case | Description |
|---|---|
TestScrapeJob_CredentialAssignment | Only agents with credential receive job |
TestScrapeJob_Parser_YQ_ArrayIndexing | .dependencies[0].version works |
TestScrapeJob_Parser_Regex_NoMatch | Graceful error when pattern doesn’t match |
TestScrapeJob_VersionTransform_All | All 5 transforms work correctly |
TestScrapeJob_HistoryLimit_Cleanup | Old snapshots deleted when limit exceeded |
TestScrapeJob_TriggerRuleEvaluation | Async rule evaluation triggered on success |
Task Execution System
| Test Case | Description |
|---|---|
TestTask_SkipLocked_ConcurrentClaim | Two agents don’t get same job |
TestTask_TokenExpiry_ReturnsConflict | 409 returned for expired token |
TestTask_IdempotentCompletion | Same token submitted twice returns 202 |
TestTask_ResultProcessingFailure | Agent gets 202 even if processing fails |
TestTask_CapabilityFiltering | Agent without capability doesn’t receive job type |
Rules Engine
| Test Case | Description |
|---|---|
TestRule_DaysBehind_VersionNotFound | CRITICAL with -1 behindBy |
TestRule_MajorsBehind_VersionParseFail | CRITICAL on semver error |
TestRule_MinorsBehind_ReleaseListVsFormula | Both methods produce reasonable results |
TestRule_StableOnly_SkipsPrerelease | alpha/beta/rc excluded from latest |
TestRule_ThresholdTiers_HighestWins | Critical returned when >= critical threshold |
Alert System
| Test Case | Description |
|---|---|
TestAlert_OnePerConfig | Only one active alert per config allowed |
TestAlert_UpdateInPlace | Version change updates existing alert |
TestAlert_AckResetOnVersionChange | Ack cleared only when discovered version changes |
TestAlert_AckPreservedOnSameVersion | Ack preserved when same version re-evaluated |
TestAlert_SoftDelete_SetsResolvedAt | Resolution sets resolved_at, doesn’t delete |
TestAlert_SoftDelete_PreservesHistory | Resolved alerts accessible via /alerts/resolved |
TestAlert_SoftDelete_NewAlertAfterResolve | New alert can be created after previous resolved |
TestAlert_AutoTrigger_OnConfigCreate | Evaluation runs after config creation |
TestAlert_AutoTrigger_OnScrapeSuccess | Evaluation runs after scrape completes |
TestAlert_AutoTrigger_OnGatherSuccess | Evaluation runs after gather completes |
TestAlert_Resolution_NotifiesWebhook | alert.resolved event dispatched on resolve |
TestAlert_BulkAcknowledge_OnlyActive | Bulk ack only affects active alerts |
TestAlert_ListResolved_Pagination | /alerts/resolved supports limit/offset |
TestAlert_ListResolved_SeverityFilter | /alerts/resolved filters by severity |
Notification System
| Test Case | Description |
|---|---|
TestNotification_RuleMatching_SeverityFilter | Rule with severity filter only matches specified severities |
TestNotification_RuleMatching_EventFilter | Rule with event filter only matches specified events |
TestNotification_RuleMatching_CatchAll | Empty filters match all severities and events |
TestNotification_RuleMatching_Priority | Higher priority rules evaluated first |
TestNotification_Dispatch_ChannelDedup | Same alert doesn’t notify same channel twice |
TestNotification_Dispatch_OrgDefault | Falls back to org default when no rule matches |
TestNotification_Dispatch_RepeatInterval | Skips notification within repeat interval |
TestNotification_Delivery_SkipLocked | Multiple notifiers don’t claim same delivery |
TestNotification_Delivery_StatusTransitions | Correct state machine: pending→in_progress→succeeded/failed |
TestNotification_Retry_ExponentialBackoff | Delays increase with attempts |
TestNotification_Retry_RetryAfterHeader | Honors 429 Retry-After header |
TestNotification_Retry_DeadLetter | Moves to dead_letter after max attempts |
TestNotification_Retry_NonRetryable4xx | 4xx (except 429) goes to dead_letter immediately |
TestNotification_Webhook_HMACSignature | Correct HMAC-SHA256 signature in header |
TestNotification_Webhook_IdempotencyKey | Same key across retries |
TestNotification_Webhook_SSRFProtection | Private IPs blocked by default |
TestNotification_Ack_TokenValidation | Token lookup, expiry, and single-use |
TestNotification_Ack_AlertUpdate | Alert marked acknowledged via callback |
TestNotification_Housekeeping_ExpireTokens | Expired tokens deleted |
TestNotification_Housekeeping_CleanupDeliveries | Stuck deliveries moved to dead_letter |
Multi-Tenancy
| Test Case | Description |
|---|---|
TestTenancy_OrgIsolation | Org A can’t see Org B resources |
TestTenancy_GlobalVisibility | All orgs see global resources |
TestTenancy_SystemKeyCreatesGlobal | System key creates is_global=true |
TestTenancy_ScopeParameter | org/global/all filters work correctly |
Agent Communication
| Test Case | Description |
|---|---|
TestAgent_Heartbeat_UpdatesLastSeen | Timestamp updated on heartbeat |
TestAgent_Heartbeat_StoresCapabilities | Metadata includes capabilities JSON |
TestAgent_OrphanCleanup_DisconnectedAgent | Jobs reset when agent removed |
TestAgent_OrphanCleanup_StartupDelay | No cleanup in first 30 seconds |
Metrics API
Metrics api should comply with the OpenMetrics standard. See: https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md
| Test Case | Description |
|---|---|
TestMetrics_ContentNegotiation_JSON | Default Accept header returns JSON |
TestMetrics_ContentNegotiation_Prometheus | Accept: text/plain returns Prometheus format |
TestMetrics_NoAuthRequired | Endpoint accessible without API key |
TestMetrics_SystemWideOrganizationCounts | Returns total and active org counts |
TestMetrics_ServiceInstancesByType | Returns service instances grouped by type |
TestMetrics_JobCountsAllTypes | Includes gather, scrape, and helm_sync jobs |
TestMetrics_AgentHealth_5MinuteThreshold | Agents without heartbeat in 5 min marked unhealthy |
TestMetrics_TaskExecutions_24HourWindow | Only includes executions from last 24 hours |
TestMetrics_TaskExecutions_SuccessRate | Calculates success rate correctly |
TestMetrics_APIKeyCounts | Returns total, active, and system key counts |
TestMetrics_PrometheusFormat_NoOrgLabels | Prometheus output has no org_id labels |
TestMetrics_PrometheusFormat_HelpAndType | All metrics have HELP and TYPE declarations |
Admin UI
| Test Case | Description |
|---|---|
TestUI_Login_ValidKey | Valid API key sets cookie and redirects to dashboard |
TestUI_Login_InvalidKey | Invalid key shows error on login page |
TestUI_Logout_ClearsCookie | Logout clears auth cookie and redirects to login |
TestUI_Dashboard_DefaultScope | Dashboard defaults to org scope |
TestUI_Dashboard_ScopeFilter | Scope parameter filters jobs correctly |
TestUI_Dashboard_GracefulDegradation | Dashboard renders with empty data when API fails |
TestUI_ListPage_Pagination | Limit/offset query params paginate results |
TestUI_ListPage_EmptyState | Empty table shows “No items found” message |
TestUI_CreateForm_Validation | Missing required fields redirect with error message |
TestUI_CreateForm_Success | Valid form redirects with success message |
TestUI_Toggle_HTMX | Toggle with HX-Request header returns updated row only |
TestUI_Toggle_NonHTMX | Toggle without HTMX redirects to list page |
TestUI_Delete_Confirmation | Delete form includes confirmation dialog |
TestUI_ErrorSanitization | Internal errors mapped to generic user-friendly messages |
Appendix: Key SQL Functions Reference
Gather Jobs
| Function | Purpose |
|---|---|
CreateGatherJob | Insert new job with defaults |
ClaimNextPendingGatherJob | Atomic claim with SKIP LOCKED |
CompleteGatherJob | Mark successful completion |
FailGatherJob | Increment attempts, possibly transition to failed |
RescheduleGatherJob | Calculate and set next_run_at |
ResetStaleGatherJobs | Reset jobs in_progress > 1 hour |
ResetOrphanedGatherJobs | Reset jobs claimed by dead agents |
Scrape Jobs
| Function | Purpose |
|---|---|
CreateScrapeJob | Insert new job with defaults |
ClaimNextPendingScrapeJobWithCredentials | Credential-aware atomic claim |
CompleteScrapeJob | Mark successful completion |
FailScrapeJob | Increment attempts, possibly transition to failed |
RescheduleScrapeJob | Calculate and set next_run_at |
Version Snapshots
| Function | Purpose |
|---|---|
CreateVersionSnapshot | Insert new snapshot (no ON CONFLICT - full history) |
GetVersionSnapshot | Get snapshot by ID |
GetVersionSnapshotByScrapeJob | Get latest snapshot for job (ORDER BY id DESC) |
GetLatestVersionSnapshot | Get latest version for rules evaluation (ORDER BY id DESC) |
ListVersionHistoryByScrapeJob | List version history with limit |
CountVersionSnapshotsByScrapeJob | Count snapshots for history display |
DeleteOldVersionSnapshots | Purge snapshots beyond retention limit |
Note: All “latest” queries use ORDER BY id DESC instead of ORDER BY discovered_at DESC to guarantee insertion order regardless of timestamp precision.
Alerts
| Function | Purpose |
|---|---|
UpsertAlert | Create or update active alert (one per config) |
GetAlertByID | Get alert with joined config/rule data |
GetAlertByConfigID | Get active alert for specific config |
ListAlertsByOrganization | List active alerts with filters |
CountAlertsByOrganization | Count active alerts with filters |
ListResolvedAlerts | List resolved (historical) alerts |
CountResolvedAlerts | Count resolved alerts |
AcknowledgeAlert | Mark active alert as acknowledged |
BulkAcknowledgeAlerts | Batch acknowledge by ID array |
UnacknowledgeAlert | Clear acknowledgement on active alert |
ResolveAlert | Soft delete: set resolved_at timestamp |
ResolveAlertsForConfig | Soft delete all active alerts for config |
GetAlertSummary | Count active alerts by severity and ack status |
PurgeOldResolvedAlerts | Hard delete resolved alerts older than N days |
Notification Channels
| Function | Purpose |
|---|---|
CreateNotificationChannel | Insert new channel with config |
GetNotificationChannel | Get channel by ID with org check |
ListNotificationChannels | List channels for organization |
UpdateNotificationChannel | Update channel settings |
DeleteNotificationChannel | Remove channel |
ToggleNotificationChannel | Toggle active state |
Notification Rules
| Function | Purpose |
|---|---|
CreateNotificationRule | Insert new rule with filters |
GetNotificationRule | Get rule by ID with org check |
ListNotificationRules | List rules for organization (ordered by priority) |
UpdateNotificationRule | Update rule settings |
DeleteNotificationRule | Remove rule |
ToggleNotificationRule | Toggle active state |
GetMatchingRulesForAlert | Get rules matching severity and event type |
Notification Deliveries
| Function | Purpose |
|---|---|
CreateNotificationDelivery | Insert new delivery record |
ClaimPendingDeliveries | Atomic claim with SKIP LOCKED |
MarkDeliverySucceeded | Update status to succeeded |
MarkDeliveryFailed | Update status and schedule retry |
ListDeliveriesByOrganization | List deliveries with filters |
ListDeliveriesByAlert | List deliveries for specific alert |
ListDeadLetters | List dead letter deliveries |
ResetDeliveryForRetry | Reset dead letter for retry |
RecentSuccessInGroup | Check for recent success in group interval |
Notification Ack Tokens
| Function | Purpose |
|---|---|
CreateAckToken | Insert new acknowledgment token |
GetAckTokenByToken | Lookup token by value |
MarkAckTokenUsed | Mark token as used |
CleanupExpiredAckTokens | Delete expired tokens |
System Metrics
| Function | Purpose |
|---|---|
GetSystemMetricsOrganizationCounts | Total and active organization counts |
GetSystemMetricsServiceInstances | Service instances by type with health status |
GetSystemMetricsAgentCounts | Agent counts with health status (system-wide) |
GetSystemMetricsGatherJobCounts | Gather job counts by status (system-wide) |
GetSystemMetricsScrapeJobCounts | Scrape job counts by status (system-wide) |
GetSystemMetricsHelmSyncJobCounts | Helm sync job counts by status (system-wide) |
GetSystemMetricsAlertSummary | Alert counts with severity breakdown (system-wide) |
GetSystemMetricsReleaseSummary | Release counts (system-wide) |
GetSystemMetricsTaskExecutions | Task execution stats for last 24 hours (system-wide) |
GetSystemMetricsAPIKeyCounts | API key counts (total, active, system) |
Document generated: 2026-02-05 Source: Planekeeper codebase analysis