← trebben.dk

How to monitor Kubernetes CronJobs

March 2026 · Practical guide

Kubernetes CronJobs fail in ways that traditional cron never did. A cron job either runs or it doesn't. A Kubernetes CronJob can fail to schedule, fail to pull an image, get OOMKilled mid-execution, exceed its backoff limit, or get silently skipped by a concurrency policy — and kubectl get cronjobs will still show a last-schedule time as if everything worked.

If you're coming from traditional cron monitoring, the mental model needs to shift. The job running is no longer the hard part. The hard part is knowing whether it completed successfully through all the layers between your CronJob spec and the actual work getting done.

The six ways Kubernetes CronJobs fail

Before picking a monitoring approach, understand the failure modes. Each one requires different detection:

  1. Scheduling failures. The CronJob controller can't create a Job object. Happens when ResourceQuotas are exhausted or the controller is down. No pod is ever created. kubectl get jobs shows nothing for that time window.
  2. Pod scheduling failures. The Job exists but the pod is stuck in Pending. Node affinity rules, insufficient resources, or taints preventing scheduling. The job appears "active" indefinitely.
  3. Image pull failures. Pod is created but stuck in ImagePullBackOff. Registry credentials expired, image tag deleted, or network issues. Common after deployments that update the CronJob image tag.
  4. OOMKilled / resource limits. Pod starts, runs partway through, gets killed. Exit code 137. The job may retry (depending on backoffLimit) and fail repeatedly until the limit is reached.
  5. ConcurrencyPolicy skipping. With concurrencyPolicy: Forbid, if a previous run is still active when the next schedule arrives, the run is silently skipped. No event, no error, no log. The most invisible failure mode.
  6. Deadline exceeded. If startingDeadlineSeconds passes before the job starts (controller downtime, resource pressure), the run is abandoned. With the default of no deadline, missed runs pile up and may all execute at once when pressure clears.

Four monitoring approaches

Approach 1

kubectl checks

The simplest approach. Poll job status from outside the cluster (or inside, via a monitoring pod).

# List recent jobs for a CronJob, sorted by start time:
kubectl get jobs --selector=job-name -l app=nightly-backup \
  --sort-by=.status.startTime

# Check for failed jobs in the last 24 hours:
kubectl get jobs --field-selector=status.successful=0 \
  --sort-by=.status.startTime

# Get pod status for a specific job:
kubectl get pods --selector=job-name=nightly-backup-28517280

# Check CronJob last schedule time:
kubectl get cronjob nightly-backup \
  -o jsonpath='{.status.lastScheduleTime}'

You can wrap this in a script that runs on a timer and sends alerts. Crude but effective for small clusters.

CatchesMisses
Failed jobs, stuck pods, missing runs Requires external scheduler (another cron problem), polling delay

Approach 2

Prometheus + kube-state-metrics

If you already run Prometheus, kube-state-metrics exposes the metrics you need. No extra agents.

# Alert when a job fails:
- alert: KubeCronJobFailed
  expr: kube_job_status_failed{job_name=~".*"} > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "CronJob {{ $labels.job_name }} failed"

# Alert when a CronJob hasn't run on schedule:
- alert: KubeCronJobMissed
  expr: |
    time() - kube_cronjob_status_last_schedule_time
    > kube_cronjob_spec_schedule_interval * 2
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "CronJob {{ $labels.cronjob }} missed schedule"

This is the right approach if you have Prometheus already. Setting it up just for CronJob monitoring is overkill.

CatchesMisses
Failed jobs, missed schedules, stuck jobs Doesn't verify job output is correct, complex to set up from scratch

Approach 3

Kubernetes Events + alerting

Kubernetes emits events for CronJob failures. You can watch for specific event reasons:

# Watch for CronJob-related events:
kubectl get events --field-selector \
  reason=FailedCreate,reason=BackoffLimitExceeded,reason=DeadlineExceeded \
  --sort-by='.lastTimestamp'

# In a monitoring script:
kubectl get events -o json | jq '.items[] |
  select(.reason == "BackoffLimitExceeded" or
         .reason == "FailedCreate" or
         .reason == "DeadlineExceeded") |
  {reason: .reason, name: .involvedObject.name,
   time: .lastTimestamp, message: .message}'

Tools like kubewatch or Kubernetes event exporters can pipe these to Slack or webhooks. Reacts faster than polling because events are pushed.

CatchesMisses
Explicit failures with events (BackoffLimitExceeded, DeadlineExceeded, FailedCreate) ConcurrencyPolicy skips (no event emitted), events expire after 1 hour by default

Approach 4

Heartbeat monitoring

The job itself reports success by pinging an external endpoint. If the ping doesn't arrive within the expected window, something went wrong — anywhere in the chain from scheduling to completion.

# In your CronJob container command:
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: your-backup-image:latest
            command:
            - /bin/sh
            - -c
            - |
              /usr/local/bin/backup.sh && \
              curl -sf https://ping.trebben.dk/p/YOUR_SLUG
          restartPolicy: Never
      backoffLimit: 2

This is the only approach that catches every failure mode, including ConcurrencyPolicy skips. If the job didn't complete successfully, the ping doesn't arrive. The monitoring endpoint doesn't care why — scheduling failure, OOMKilled, image pull error, or application bug all look the same: a missing ping.

CatchesMisses
All failure modes including silent skips, scheduling failures, partial execution Requires outbound HTTP from pods (most clusters allow this), no detail on why it failed

Which approach to use

They're not mutually exclusive. The best setup combines two layers:

For a small cluster with a handful of CronJobs, heartbeat monitoring alone is sufficient. You'll see the alert, ssh in, and check kubectl describe job to figure out the cause.

For larger clusters, Prometheus gives you dashboards and historical data, but it can't catch the scheduling-level failures that never produce a pod. Heartbeat monitoring covers that gap.

CronJob spec hardening

Before monitoring, get the spec right. These three fields prevent the most common surprises:

spec:
  # Don't let runs overlap. Skipped runs are caught by heartbeat monitoring:
  concurrencyPolicy: Forbid

  # If the controller can't start the job within 5 minutes, skip it:
  startingDeadlineSeconds: 300

  # Keep completed/failed jobs for debugging (default is 3/1):
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5

  jobTemplate:
    spec:
      # Retry failed runs (but not forever):
      backoffLimit: 2

      # Kill jobs that run too long:
      activeDeadlineSeconds: 3600

startingDeadlineSeconds is the most important and most overlooked. Without it, if the CronJob controller is down for an hour, every missed run queues up and fires simultaneously when the controller recovers. With it, stale runs are dropped.

Setting up heartbeat monitoring

Add a curl to the end of your container command, gated by && so it only fires on success:

# Simple — ping after success:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf https://ping.trebben.dk/p/YOUR_SLUG"]

# With timeout — don't let curl hang if the endpoint is slow:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf --max-time 10 https://ping.trebben.dk/p/YOUR_SLUG"]

# If your job uses an init container for setup:
command: ["/bin/sh", "-c", "/app/run.sh && curl -sf https://ping.trebben.dk/p/YOUR_SLUG"]
# (Only the final container command needs the ping)

CronPulse does exactly this.

20 monitors free. No agents, no containers, no config files.
Add && curl to your Kubernetes CronJob command — get alerted within minutes when a job stops.

Start monitoring →

The -f flag in curl makes it return a non-zero exit code on HTTP errors, so a failed ping doesn't mask the job's success status.

Why your cron job isn't running →  ·  Monitoring systemd timers →  ·  How to get alerted when a cron job fails →  ·  How to monitor cron jobs →  ·  ← trebben.dk