Background job monitoring: a production checklist for scheduled tasks

Background jobs are the infrastructure layer nobody talks about until they break.

Your web requests have error rates, latency dashboards, and uptime monitors. Your APIs have health checks and status pages. But your nightly database backup? Your daily user sync? Your invoice generator? They run in the background, outside your observability stack, with no one watching.

GitLab famously lost six hours of production data in 2017 partly due to backup failures that went undetected. The failure wasn't malicious or complex — a script was running but not completing successfully. Nobody knew.

This is a production checklist for background job monitoring. Go through it for every scheduled task that matters.

The non-negotiables: what every background job needs

1. External missed run detection

Your job must be watched by something outside your process. An in-process scheduler (node-cron, Celery beat, cron daemon) cannot detect its own non-execution.

The minimum requirement: a service that knows your job's schedule and fires an alert if no ping arrives within a grace period.

For Node.js:

import { CrontifyMonitor } from '@crontify/sdk';

const monitor = new CrontifyMonitor({
  apiKey: process.env.CRONTIFY_API_KEY!,
  monitorId: 'your-monitor-id',
});

// wrap() automatically sends start + success/fail pings
await monitor.wrap(async () => {
  await runYourJob();
});

For any other language (bash, Python, Go, Ruby):

BASE="https://api.crontify.com/api/v1/ping/your-monitor-id"
HEADERS="-H 'X-API-Key: your-key'"

curl -fsS -X POST "$BASE/start" $HEADERS
your_job_command
[ $? -eq 0 ] && curl -fsS -X POST "$BASE/success" $HEADERS \
              || curl -fsS -X POST "$BASE/fail" $HEADERS

2. Maximum duration threshold

Every job has a maximum reasonable runtime. Set it. A job that normally takes 5 minutes and has been running for 3 hours is hung — it should be alerting, not running.

Configure this per monitor in your monitoring tool. The appropriate threshold is 2–3× the p95 runtime of your job, measured over at least 10 executions.

3. Failure notification to a channel someone watches

An alert to an email inbox nobody checks is worse than no alert — it creates a false sense of safety. Route critical job failures to Slack, PagerDuty, or whatever channel your team actually responds to.

For less critical jobs: a low-priority Slack channel is fine. For data integrity jobs (backups, billing, financial data): on-call notification.

4. Log attachment on failure

When a job fails at 2am, you want the diagnosis in the alert, not waiting for someone to SSH into a server.

try {
  await runJob();
  await monitor.success();
} catch (err) {
  await monitor.fail({
    message: err.message,
    log: err.stack,
  });
}

Capture as much context as possible: the error, the stack trace, the last database query, the API response that caused the failure, how many records were processed before the failure.

The commonly missed: failure modes most teams don't monitor for

5. Silent failure detection (jobs that succeed but accomplish nothing)

This is the failure mode with the highest damage-to-detection ratio. A job that consistently exits 0 while processing zero records will run undetected for weeks.

The fix is output metadata and alert rules:

await monitor.success({
  meta: {
    records_processed: result.count,
    api_calls_made: result.apiCalls,
    errors_encountered: result.errors,
  }
});

Then define rules: records_processed eq 0 → fire alert. errors_encountered gt 0 → fire alert.

This is available in Crontify as alert rules on job output metadata. Define them in the dashboard, per monitor.

6. Overlap detection

A new instance of a job starts before the previous one finishes. This usually means your job is taking longer than its schedule interval — a daily job that takes 26 hours will overlap. Consequences include data corruption, deadlocks, and doubled processing.

Crontify detects this automatically: when a start ping arrives for a monitor that already has a run in running state, the previous run is marked as overlapped and an alert fires.

7. Gradual performance degradation

A job that took 5 minutes last month and takes 45 minutes today is heading toward failure even if it's currently completing. Unindexed tables, data growth, accumulated technical debt — these cause slow degradation that shows up as duration trends before they show up as failures.

Track duration over time and alert when it exceeds a threshold:

await monitor.success({
  meta: {
    duration_ms: Date.now() - startTime,
    records_processed: result.count,
  }
});

Define a rule: duration_ms gt 2700000 (45 minutes) → fire alert before the job actually starts failing.

The infrastructure checklist

Environment and configuration

All environment variables required by the job are explicitly set in the execution environment. Cron runs with a minimal PATH — don't assume shell profile variables are available.
Absolute paths are used for all scripts and binaries in crontab entries.
Timezone is explicitly configured for all scheduled jobs. Server timezone and job timezone should be treated as separate concerns.
Output is logged somewhere reviewable: your-job.sh >> /var/log/jobs/your-job.log 2>&1

Error handling

Every job has a top-level try/catch (or equivalent) that logs the error and reports it to the monitoring service.
All network calls have explicit timeouts. No naked fetch() or database calls with no timeout configured.
Database transactions have explicit timeouts to prevent indefinite lock waiting.
Jobs that process records in a loop handle individual record failures without crashing the entire run.

Monitoring configuration

Every critical job has a monitor configured in an external monitoring service.
Each monitor has a grace period appropriate for its schedule and typical runtime.
Alert channels are configured and tested — use the test alert feature before going to production.
Alert rules are configured for silent failure detection on jobs where "ran but did nothing" is a meaningful failure.
Recovery alerts are enabled so the team knows when incidents are resolved.

Operational

Monitoring alerts route to the right people at the right severity. Backup failures wake someone up. Slow analytics jobs send a morning Slack message.
Runbooks exist for the most common job failures. When an alert fires at 2am, the on-call engineer shouldn't have to figure out what the job does from scratch.
Monitors are paused before planned maintenance windows and resumed afterward.

A taxonomy of background jobs by criticality

Not all background jobs require the same level of monitoring rigor. A practical classification:

Tier 1 — Data integrity and financial

Examples: database backups, billing runs, payment processing, financial data reconciliation.

Requirements: External missed run detection, hung job detection, failure alerts to on-call, log attachment, silent failure detection, overlap detection. Recovery alerts. Runbook required.

Tier 2 — User-facing data

Examples: user sync, email dispatch, report generation, API data ingestion.

Requirements: External missed run detection, failure alerts to a monitored Slack channel, silent failure detection for output validation, log attachment.

Tier 3 — Infrastructure and maintenance

Examples: cache warming, log rotation, analytics aggregation, search index updates.

Requirements: Failure alerts to a low-priority channel. Duration tracking recommended but not critical.

Tier 4 — Best-effort

Examples: thumbnail generation, social feed updates, non-critical notifications.

Requirements: Basic logging. Alert if failure rate exceeds a threshold, not on individual failures.

What to do when a job fails

When an alert fires, the response order matters:

Check the alert — does it include log output? Error message? Duration? Use this before opening a terminal.
Check the dashboard — look at recent run history. Is this the first failure or part of a pattern? Did the previous 10 runs succeed?
Assess impact — what data or functionality is affected? Is anything broken for users right now?
Decide: fix or reschedule — if the root cause is fixable quickly, fix it and trigger a manual run. If it requires investigation, trigger a manual run first to restore data freshness, then investigate.
Resolve the monitor — once the job is running successfully, verify the recovery alert fires. Update the runbook if the failure revealed a gap.

Frequently asked questions

How many monitors do I need for a typical production system?

A typical production application has 10–30 scheduled jobs. Roughly half are Tier 1 or Tier 2, requiring full monitoring. Budget for 15–20 actively monitored jobs to start and expand from there. It could however also be 100s.

Should I monitor jobs that run every minute?

Only if every single execution matters. For minutely jobs, the alert overhead (potentially 1,440 alerts per day if something breaks) needs to be managed with appropriate cooldown periods and deduplication. Consider whether the data loss from a missed single execution is meaningful before monitoring at that granularity.

What's the right grace period for a job that runs at an unpredictable time?

If your job is triggered by events rather than a fixed schedule, consider using a heartbeat model instead — your job pings a URL at regular intervals to confirm it's alive, rather than running on a schedule. Crontify supports both models.

Start monitoring for free

Crontify is free for up to 5 monitors — no credit card required.

crontify.com — SDK on npm as @crontify/sdk.