March 2026
Your cron jobs will fail. Not might. Will. The question is whether you find out from your monitoring or from a customer asking why their report didn't arrive.
Most teams only monitor one failure mode: the job crashes. That's the easy one. Here are the five that actually happen, ranked by how much damage they do before anyone notices.
Someone reboots the server and the crontab doesn't survive. Someone edits the crontab and accidentally comments out the wrong line. Someone migrates to a new container image and forgets to copy the cron configuration. The server runs out of disk and crond quietly stops scheduling.
This is the most common failure and the hardest to catch with traditional monitoring, because there's nothing to alert on. No error. No log line. No exit code. The job just... doesn't run. And nobody notices until the downstream thing it was doing stops being done.
How to catch it: Heartbeat monitoring. Instead of watching for failure, watch for the absence of success. Your job pings an endpoint every time it runs. If the ping stops, you get alerted. This is the only approach that catches "the job didn't run at all."
# Instead of just running the job:
0 * * * * /usr/local/bin/backup.sh
# Run it and ping when it succeeds:
0 * * * * /usr/local/bin/backup.sh && curl -s https://ping.trebben.dk/p/your-token
Your backup script exits 0 because the mysqldump command
isn't installed on the new server. Bash kept going. Your ETL pipeline
runs successfully — on an empty dataset because the source API
changed its authentication and the fetch returned an empty response
instead of an error.
The job ran. It "succeeded." The result is garbage.
How to catch it: Two things. First, use
set -euo pipefail at the top of every bash script so
failures actually propagate. Second, validate outputs, not just
exit codes. Check that the backup file is larger than zero bytes.
Check that the row count is within expected range.
#!/bin/bash
set -euo pipefail
mysqldump mydb > /tmp/backup.sql
# Validate: is the backup actually there and non-empty?
if [ ! -s /tmp/backup.sql ]; then
echo "Backup file is empty" >&2
exit 1
fi
# Only ping on verified success
curl -s https://ping.trebben.dk/p/your-token
Your hourly job takes 90 minutes. Cron doesn't know or care. It launches a second instance. Now you have two processes fighting over the same files, double-inserting into the same database, or sending duplicate emails. And it only happens when load is high — which is exactly when you're least likely to notice.
How to catch it: Use flock to prevent
overlapping runs. This is the simplest fix in all of systems
administration and almost nobody does it.
# flock prevents concurrent execution
0 * * * * flock -n /tmp/etl.lock /usr/local/bin/etl.sh && curl -s https://ping.trebben.dk/p/your-token
The -n flag means "don't wait, just fail if locked."
If the previous run is still going, this invocation exits immediately.
No overlap, no corruption.
Your server is in UTC. Your crontab says 0 9 * * *
because you want the report at 9am. But 9am where? After DST,
it's off by an hour. After a server migration to a different
cloud region, it's off by three.
Cron expression syntax is also harder than it looks.
*/15 9-17 * * 1-5 seems clear until you realize
you have to verify whether the day-of-week field is 0-indexed
or 1-indexed on your system. (It's both, depending on implementation.)
How to catch it: Always use UTC in your crontab.
Add CRON_TZ=UTC or set the system timezone explicitly.
Use a cron expression tool to verify your schedule does what you
think it does. And monitor the actual run time — if a job that
should run at 09:00 starts pinging at 12:00, something changed.
The database credentials rotated. The API token expired. The NFS mount went stale. The DNS resolver is timing out. The security group changed.
These fail differently from a code bug because the job itself is fine — the environment around it shifted. And they're intermittent, which is worse. The job works Tuesday, fails Wednesday because of a network blip, works Thursday. Your logs show one failure between two successes, so nobody investigates.
How to catch it: This is where heartbeat monitoring with grace periods helps. If your hourly job misses a single ping, maybe that's a transient failure. If it misses two in a row, something is wrong. Set your alert threshold to match your tolerance.
Notice that only one of these (#2, partially) is caught by "check the exit code." Error-based monitoring catches crashes. It misses everything else.
The approach that catches all five is heartbeat monitoring — expecting a positive signal on a schedule. If the signal stops, something is wrong. You don't need to enumerate every failure mode. You just need to notice when success stops happening.
This is why I built CronPulse.
Add && curl to your crontab. If the curl stops arriving,
you get an email. No agents, no configuration files, no complexity.
One curl, one endpoint, one alert.
Cron job not running? Debugging checklist → · Cron expression tester → · ← trebben.dk