← trebben.dk

Five ways cron jobs fail (and how to catch each one)

March 2026

Your cron jobs will fail. Not might. Will. The question is whether you find out from your monitoring or from a customer asking why their report didn't arrive.

Most teams only monitor one failure mode: the job crashes. That's the easy one. Here are the five that actually happen, ranked by how much damage they do before anyone notices.

1. The job stops running entirely

Someone reboots the server and the crontab doesn't survive. Someone edits the crontab and accidentally comments out the wrong line. Someone migrates to a new container image and forgets to copy the cron configuration. The server runs out of disk and crond quietly stops scheduling.

This is the most common failure and the hardest to catch with traditional monitoring, because there's nothing to alert on. No error. No log line. No exit code. The job just... doesn't run. And nobody notices until the downstream thing it was doing stops being done.

How to catch it: Heartbeat monitoring. Instead of watching for failure, watch for the absence of success. Your job pings an endpoint every time it runs. If the ping stops, you get alerted. This is the only approach that catches "the job didn't run at all."

# Instead of just running the job:
0 * * * * /usr/local/bin/backup.sh

# Run it and ping when it succeeds:
0 * * * * /usr/local/bin/backup.sh && curl -s https://ping.trebben.dk/p/your-token

2. The job runs but silently fails

Your backup script exits 0 because the mysqldump command isn't installed on the new server. Bash kept going. Your ETL pipeline runs successfully — on an empty dataset because the source API changed its authentication and the fetch returned an empty response instead of an error.

The job ran. It "succeeded." The result is garbage.

How to catch it: Two things. First, use set -euo pipefail at the top of every bash script so failures actually propagate. Second, validate outputs, not just exit codes. Check that the backup file is larger than zero bytes. Check that the row count is within expected range.

#!/bin/bash
set -euo pipefail

mysqldump mydb > /tmp/backup.sql

# Validate: is the backup actually there and non-empty?
if [ ! -s /tmp/backup.sql ]; then
  echo "Backup file is empty" >&2
  exit 1
fi

# Only ping on verified success
curl -s https://ping.trebben.dk/p/your-token

3. The job runs twice at the same time

Your hourly job takes 90 minutes. Cron doesn't know or care. It launches a second instance. Now you have two processes fighting over the same files, double-inserting into the same database, or sending duplicate emails. And it only happens when load is high — which is exactly when you're least likely to notice.

How to catch it: Use flock to prevent overlapping runs. This is the simplest fix in all of systems administration and almost nobody does it.

# flock prevents concurrent execution
0 * * * * flock -n /tmp/etl.lock /usr/local/bin/etl.sh && curl -s https://ping.trebben.dk/p/your-token

The -n flag means "don't wait, just fail if locked." If the previous run is still going, this invocation exits immediately. No overlap, no corruption.

4. The job runs at the wrong time

Your server is in UTC. Your crontab says 0 9 * * * because you want the report at 9am. But 9am where? After DST, it's off by an hour. After a server migration to a different cloud region, it's off by three.

Cron expression syntax is also harder than it looks. */15 9-17 * * 1-5 seems clear until you realize you have to verify whether the day-of-week field is 0-indexed or 1-indexed on your system. (It's both, depending on implementation.)

How to catch it: Always use UTC in your crontab. Add CRON_TZ=UTC or set the system timezone explicitly. Use a cron expression tool to verify your schedule does what you think it does. And monitor the actual run time — if a job that should run at 09:00 starts pinging at 12:00, something changed.

5. The job runs but can't reach what it needs

The database credentials rotated. The API token expired. The NFS mount went stale. The DNS resolver is timing out. The security group changed.

These fail differently from a code bug because the job itself is fine — the environment around it shifted. And they're intermittent, which is worse. The job works Tuesday, fails Wednesday because of a network blip, works Thursday. Your logs show one failure between two successes, so nobody investigates.

How to catch it: This is where heartbeat monitoring with grace periods helps. If your hourly job misses a single ping, maybe that's a transient failure. If it misses two in a row, something is wrong. Set your alert threshold to match your tolerance.

The pattern

Notice that only one of these (#2, partially) is caught by "check the exit code." Error-based monitoring catches crashes. It misses everything else.

The approach that catches all five is heartbeat monitoring — expecting a positive signal on a schedule. If the signal stops, something is wrong. You don't need to enumerate every failure mode. You just need to notice when success stops happening.

This is why I built CronPulse. Add && curl to your crontab. If the curl stops arriving, you get an email. No agents, no configuration files, no complexity. One curl, one endpoint, one alert.

Cron job not running? Debugging checklist →  ·  Cron expression tester →  ·  ← trebben.dk