SilentFail · Job Monitoring Docs

Instrument jobs in minutes

How to send pings, understand states, avoid duplicate alerts, and verify your monitoring is reliable.

Mental model

Each job has a unique token. Send HTTP pings to mark lifecycle: start, success, fail. The monitoring worker runs every 60s to detect timeouts or missed executions (stale) and fires one alert per incident.

RUNNING: after start.
OK: after success within expected interval.
FAILED: explicit fail or timeout (no success within timeout).
STALE: no success within expected interval and not running.

docs.endpoints

Start

POST https://<app>/api/ping/<TOKEN>/start
Headers: none required
Body: empty

Success

POST https://<app>/api/ping/<TOKEN>/success
Body: empty

Fail

POST https://<app>/api/ping/<TOKEN>/fail
Body: optional { "message": "context about the error" }

Tip: use short client-side timeouts (3-5s) so your job never blocks on network hiccups.

Best practices

Send start at job start, success on completion, and fail on exceptions.
Set realistic expected_interval_minutes and timeout_minutes to avoid false positives.
Use finite retries with backoff; avoid endless retry loops.
Keep the token in environment variables; never commit it to repos.
One alert per incident; a subsequent success resets alerts.

Quick test (smoke)

Use the smoke script with a test token:

TEST_APP_URL="https://silent-fail.kreatives.io"
TEST_PING_TOKEN="your_token"
pnpm smoke

This sends start/success/fail so you can see events and status on the dashboard.

When do alerts fire?

FAILED: fail ping or timeout (running beyond timeout_minutes).
STALE: no success within expected_interval_minutes and not running.
Dedup: one alert per incident; reset on next success.