March 2026 · Practical guide
Kubernetes CronJobs fail in ways that traditional cron never did.
A cron job either runs or it doesn't. A Kubernetes CronJob can fail
to schedule, fail to pull an image, get OOMKilled mid-execution,
exceed its backoff limit, or get silently skipped by a concurrency
policy — and kubectl get cronjobs will still
show a last-schedule time as if everything worked.
If you're coming from traditional cron monitoring, the mental model needs to shift. The job running is no longer the hard part. The hard part is knowing whether it completed successfully through all the layers between your CronJob spec and the actual work getting done.
Before picking a monitoring approach, understand the failure modes. Each one requires different detection:
kubectl get jobs shows nothing for that time window.
Pending. Node affinity rules,
insufficient resources, or taints preventing scheduling. The job
appears "active" indefinitely.
ImagePullBackOff. Registry credentials expired, image
tag deleted, or network issues. Common after deployments that update
the CronJob image tag.
backoffLimit) and fail repeatedly until
the limit is reached.
concurrencyPolicy: Forbid, if a previous run is still
active when the next schedule arrives, the run is silently skipped.
No event, no error, no log. The most invisible failure mode.
startingDeadlineSeconds
passes before the job starts (controller downtime, resource pressure),
the run is abandoned. With the default of no deadline, missed runs
pile up and may all execute at once when pressure clears.
Approach 1
The simplest approach. Poll job status from outside the cluster (or inside, via a monitoring pod).
# List recent jobs for a CronJob, sorted by start time:
kubectl get jobs --selector=job-name -l app=nightly-backup \
--sort-by=.status.startTime
# Check for failed jobs in the last 24 hours:
kubectl get jobs --field-selector=status.successful=0 \
--sort-by=.status.startTime
# Get pod status for a specific job:
kubectl get pods --selector=job-name=nightly-backup-28517280
# Check CronJob last schedule time:
kubectl get cronjob nightly-backup \
-o jsonpath='{.status.lastScheduleTime}'
You can wrap this in a script that runs on a timer and sends alerts. Crude but effective for small clusters.
| Catches | Misses |
|---|---|
| Failed jobs, stuck pods, missing runs | Requires external scheduler (another cron problem), polling delay |
Approach 2
If you already run Prometheus, kube-state-metrics exposes the metrics you need. No extra agents.
# Alert when a job fails:
- alert: KubeCronJobFailed
expr: kube_job_status_failed{job_name=~".*"} > 0
for: 1m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.job_name }} failed"
# Alert when a CronJob hasn't run on schedule:
- alert: KubeCronJobMissed
expr: |
time() - kube_cronjob_status_last_schedule_time
> kube_cronjob_spec_schedule_interval * 2
for: 10m
labels:
severity: critical
annotations:
summary: "CronJob {{ $labels.cronjob }} missed schedule"
This is the right approach if you have Prometheus already. Setting it up just for CronJob monitoring is overkill.
| Catches | Misses |
|---|---|
| Failed jobs, missed schedules, stuck jobs | Doesn't verify job output is correct, complex to set up from scratch |
Approach 3
Kubernetes emits events for CronJob failures. You can watch for specific event reasons:
# Watch for CronJob-related events:
kubectl get events --field-selector \
reason=FailedCreate,reason=BackoffLimitExceeded,reason=DeadlineExceeded \
--sort-by='.lastTimestamp'
# In a monitoring script:
kubectl get events -o json | jq '.items[] |
select(.reason == "BackoffLimitExceeded" or
.reason == "FailedCreate" or
.reason == "DeadlineExceeded") |
{reason: .reason, name: .involvedObject.name,
time: .lastTimestamp, message: .message}'
Tools like kubewatch or Kubernetes event exporters can pipe these to Slack or webhooks. Reacts faster than polling because events are pushed.
| Catches | Misses |
|---|---|
| Explicit failures with events (BackoffLimitExceeded, DeadlineExceeded, FailedCreate) | ConcurrencyPolicy skips (no event emitted), events expire after 1 hour by default |
Approach 4
The job itself reports success by pinging an external endpoint. If the ping doesn't arrive within the expected window, something went wrong — anywhere in the chain from scheduling to completion.
# In your CronJob container command:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 300
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: your-backup-image:latest
command:
- /bin/sh
- -c
- |
/usr/local/bin/backup.sh && \
curl -sf https://ping.trebben.dk/p/YOUR_SLUG
restartPolicy: Never
backoffLimit: 2
This is the only approach that catches every failure mode, including ConcurrencyPolicy skips. If the job didn't complete successfully, the ping doesn't arrive. The monitoring endpoint doesn't care why — scheduling failure, OOMKilled, image pull error, or application bug all look the same: a missing ping.
| Catches | Misses |
|---|---|
| All failure modes including silent skips, scheduling failures, partial execution | Requires outbound HTTP from pods (most clusters allow this), no detail on why it failed |
They're not mutually exclusive. The best setup combines two layers:
For a small cluster with a handful of CronJobs, heartbeat monitoring
alone is sufficient. You'll see the alert, ssh in, and check
kubectl describe job to figure out the cause.
For larger clusters, Prometheus gives you dashboards and historical data, but it can't catch the scheduling-level failures that never produce a pod. Heartbeat monitoring covers that gap.
Before monitoring, get the spec right. These three fields prevent the most common surprises:
spec:
# Don't let runs overlap. Skipped runs are caught by heartbeat monitoring:
concurrencyPolicy: Forbid
# If the controller can't start the job within 5 minutes, skip it:
startingDeadlineSeconds: 300
# Keep completed/failed jobs for debugging (default is 3/1):
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
jobTemplate:
spec:
# Retry failed runs (but not forever):
backoffLimit: 2
# Kill jobs that run too long:
activeDeadlineSeconds: 3600
startingDeadlineSeconds is the most important and most
overlooked. Without it, if the CronJob controller is down for an
hour, every missed run queues up and fires simultaneously when
the controller recovers. With it, stale runs are dropped.
Add a curl to the end of your container command,
gated by && so it only fires on success:
# Simple — ping after success:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf https://ping.trebben.dk/p/YOUR_SLUG"]
# With timeout — don't let curl hang if the endpoint is slow:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf --max-time 10 https://ping.trebben.dk/p/YOUR_SLUG"]
# If your job uses an init container for setup:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf https://ping.trebben.dk/p/YOUR_SLUG"]
# (Only the final container command needs the ping)
CronPulse does exactly this.
20 monitors free. No agents, no containers, no config files.
Add && curl to your Kubernetes CronJob command — get alerted within minutes when a job stops.
The -f flag in curl makes it return a non-zero exit
code on HTTP errors, so a failed ping doesn't mask the job's
success status.
Why your cron job isn't running → · Monitoring systemd timers → · How to get alerted when a cron job fails → · How to monitor cron jobs → · ← trebben.dk