SilentFail · Job Monitoring Docs

Instrument jobs in minutes

How to send pings, understand states, avoid duplicate alerts, and verify your monitoring is reliable.

Mental model

Each job has a unique token. Send HTTP pings to mark lifecycle: start, success, fail. The monitoring worker runs every 60s to detect timeouts or missed executions (stale) and fires one alert per incident.

  • RUNNING: after start.
  • OK: after success within expected interval.
  • FAILED: explicit fail or timeout (no success within timeout).
  • STALE: no success within expected interval and not running.
docs.endpoints

Start

POST https://<app>/api/ping/<TOKEN>/start
Headers: none required
Body: empty

Success

POST https://<app>/api/ping/<TOKEN>/success
Body: empty

Fail

POST https://<app>/api/ping/<TOKEN>/fail
Body: optional { "message": "context about the error" }

Tip: use short client-side timeouts (3-5s) so your job never blocks on network hiccups.

Best practices
  • Send start at job start, success on completion, and fail on exceptions.
  • Set realistic expected_interval_minutes and timeout_minutes to avoid false positives.
  • Use finite retries with backoff; avoid endless retry loops.
  • Keep the token in environment variables; never commit it to repos.
  • One alert per incident; a subsequent success resets alerts.
Quick test (smoke)

Use the smoke script with a test token:

TEST_APP_URL="https://silent-fail.kreatives.io"
TEST_PING_TOKEN="your_token"
pnpm smoke

This sends start/success/fail so you can see events and status on the dashboard.

When do alerts fire?
  • FAILED: fail ping or timeout (running beyond timeout_minutes).
  • STALE: no success within expected_interval_minutes and not running.
  • Dedup: one alert per incident; reset on next success.
SilentFail - Background Job Monitoring