Why your cron job is getting slower (and how to detect it automatically)

Performance degradation in cron jobs is slow and invisible. A job that processes 10,000 records in 2 minutes in January will process the same records in 8 minutes in March as the table grows. By June it takes 45 minutes. At some point it starts overlapping with the next scheduled run. By then you have a production incident that feels sudden but has been building for months.

No alert fired. Nothing errored. The job just kept getting slower.

This is the problem that duration anomaly detection solves: automatically flagging when a job runs significantly longer than its recent baseline, even when it completes successfully.

Why duration matters as a monitoring signal

Most monitoring focuses on binary outcomes: did the job run, did it finish, did it succeed or fail. Duration is treated as an informational metric rather than a signal worth alerting on.

That's a mistake. Duration tells you things that binary outcomes don't:

Upstream API degradation. A job that makes 200 API calls normally completes in 90 seconds. When the upstream starts throttling or slowing down, the same job takes 12 minutes. Exit code 0 both times. Duration is the only signal.

Table growth hitting an unindexed query. A query that runs against 100,000 rows takes 200ms. Against 10 million rows with the same schema, it takes 40 seconds. The job still completes successfully. Duration is the only signal.

Resource contention. A job that normally runs on a quiet server is now sharing CPU and I/O with a new process introduced in a recent deploy. The job still completes, but takes 3× as long. Duration is the only signal.

Data volume anomalies. A job that processes significantly more or fewer records than usual takes proportionally longer or shorter. An unexpectedly short duration can indicate that data is missing upstream — a form of silent failure detectable through time rather than output.

How duration anomaly detection works

The basic approach: calculate a baseline duration from recent successful runs and alert when the current run exceeds a multiple of that baseline.

A simple implementation using the last N completed runs:

async function checkDurationAnomaly(
  monitorId: string,
  currentDurationMs: number,
  multiplier: number = 3,
  minSamples: number = 5
): Promise<boolean> {
  const recentRuns = await db.run.findMany({
    where: {
      monitorId,
      status: 'success',
      durationMs: { not: null },
    },
    orderBy: { startedAt: 'desc' },
    take: minSamples * 2, // take more than needed to have a stable sample
  });

  if (recentRuns.length < minSamples) {
    // Not enough history to establish a baseline
    return false;
  }

  const durations = recentRuns.map(r => r.durationMs!);
  const avgDuration = durations.reduce((a, b) => a + b, 0) / durations.length;

  return currentDurationMs > avgDuration * multiplier;
}

A few details worth considering in production:

Use the median, not the mean. A single abnormally long run can inflate the average significantly, which raises the baseline and makes anomaly detection less sensitive. The median is more stable.

Require a minimum sample count. Don't flag anomalies on the first few runs — there's no baseline yet. Wait until you have at least 5–10 data points.

Exclude outliers from the baseline. If you're using previous anomalous runs to calculate the baseline, you'll drift over time. Consider excluding runs that were themselves flagged as anomalies from the baseline calculation.

Consider time-of-day effects. A job that runs at 3am on a quiet server and also at 2pm under load will have natural duration variation. Calculating separate baselines per time window reduces false positives.

Interpreting duration anomalies

A duration anomaly is a signal that warrants investigation, not necessarily a production incident. The goal is to surface it early, before it becomes one.

When a job takes 3× longer than its baseline, the right response is to look at:

Query performance. Run EXPLAIN ANALYZE on the queries in the job. Check if table statistics are stale and need ANALYZE to be run. Look for sequential scans on large tables.

External API latency. If the job makes HTTP calls, check whether response times from the upstream have increased. A simple timing log around each external call costs nothing and is invaluable for diagnosis.

Data volume. Did the job process significantly more records than usual? If so, is the growth expected, or is it a sign that something is producing records faster than it should?

Concurrency. Is the job running at the same time as something new? Recent deploys that introduced a background process, a new cron job, or a database migration can create resource contention that only manifests under combined load.

Automatic detection in practice

Crontify runs duration anomaly detection automatically for every monitored job. After a minimum of 5 successful runs establishes a baseline, any run that takes more than 3× the average duration is flagged in the run history and can trigger an alert.

The detection is entirely passive — no code changes are needed beyond the standard start/success/fail pings that already record timing. You get duration trend charts per monitor, showing how job duration has changed over time, alongside an anomaly flag on individual runs.

import { CrontifyMonitor } from '@crontify/sdk';

const monitor = new CrontifyMonitor({
  apiKey: process.env.CRONTIFY_API_KEY!,
  monitorId: 'your-monitor-id',
});

// wrap() records start and end time automatically
await monitor.wrap(async () => {
  await processRecords();
});

Duration anomaly detection requires no configuration beyond the ping instrumentation you're already doing. The baseline builds automatically from your run history.

Crontify is free for up to 5 monitors — no credit card required.

Why your cron job is getting slower (and how to detect it automatically)

Why duration matters as a monitoring signal

How duration anomaly detection works

Interpreting duration anomalies

Automatic detection in practice

Start monitoring your scheduled jobs

More from the blog

Healthchecks.io vs Crontify: which cron monitor catches more failures?

Cron job overlap: what it is, why it happens, and how to detect it