Best practices

Cron monitoring best practices.

A failed cron job that no one notices is worse than one that loudly crashes. By default, cron is silent — output is emailed to a mailbox no one reads. This guide covers five patterns for ensuring you actually know when jobs run, succeed, fail, or never fire at all.

1. Log to a file with rotation

The simplest first step: capture stdout and stderr to a log file so you can see what your job did:

0 2 * * * /path/to/backup.sh >> /var/log/backup.log 2>&1

The 2>&1 sends stderr to the same place as stdout — critical for capturing error messages.

To keep the log from growing unbounded, add log rotation. Create /etc/logrotate.d/backup:

/var/log/backup.log {
  weekly
  rotate 4
  compress
  missingok
  notifempty
}

This keeps four weeks of rotated logs, compressed.

Improvement: include a timestamp prefix

Without timestamps, you can't tell when each line was written. Pipe through ts (from moreutils) or use awk:

0 2 * * * /path/to/backup.sh 2>&1 | /usr/bin/ts >> /var/log/backup.log

Now each output line is prefixed with the local timestamp.

2. Capture exit codes

A successful job exits 0. Anything else is an error. Cron captures the exit code internally but doesn't tell you unless you ask. Wrap your job:

0 2 * * * /path/to/backup.sh; echo "$(date) exit=$?" >> /var/log/backup-runs.log

Or better, write a wrapper script that records start/end/exit code:

#!/bin/bash
JOB=$1
START=$(date +%s)
echo "[$(date)] $JOB started" >> /var/log/jobs.log
"$@" >> /var/log/jobs.log 2>&1
EXIT=$?
DURATION=$(( $(date +%s) - START ))
echo "[$(date)] $JOB finished exit=$EXIT duration=${DURATION}s" >> /var/log/jobs.log
exit $EXIT

Then cron each job as /usr/local/bin/run-job backup /path/to/backup.sh. Every job's run is now visible in /var/log/jobs.log.

3. Heartbeat / dead-man-switch monitoring

The trick: cron's silence is the bug. A job that doesn't fire produces no logs, no errors, no alerts. You need an external watcher that expects regular check-ins.

How it works

A heartbeat URL is provided by a third-party monitoring service. Your cron job hits it after each successful run:

0 2 * * * /path/to/backup.sh && curl -fsS --max-time 10 https://your-monitoring-service.example/ping/your-id

The && means the ping only fires if the job exited successfully. The service watches for these pings; if one is missing, it alerts you.

What to look for in a service

Several heartbeat-monitoring services exist — both commercial SaaS and self-hostable open-source options. When choosing one, evaluate against:

  • Expected schedule: Can you describe your job's cadence (cron expression or interval) so the service knows when to expect pings?
  • Grace period: How long after a missed ping does it alert? You want this configurable per check.
  • Alert channels: Email at minimum; ideally also Slack, PagerDuty, Discord, or webhook for integration with your existing on-call setup.
  • Free tier: Most offer one. Confirm it covers the number of jobs you'll be monitoring.
  • Self-hostable: If you're privacy-sensitive or already run your own infra, some popular options have open-source versions you can deploy yourself.
  • Pricing model: Per-check vs flat-rate vs usage-based — pick what matches your scale.

If you'd rather not depend on a third party, you can self-roll: any webhook endpoint plus a separate cron job that checks "did we get a ping in the last N minutes?" and alerts if not. Less polished but full control.

Why this matters more than file logs

File logs tell you what happened. Heartbeats tell you what didn't happen. If your nightly backup is supposed to run at 2 AM and the heartbeat service doesn't see a ping by 2:30 AM, it alerts you. With file logs alone, you'd find out the next time you needed the backup — possibly weeks later.

4. Alerting on failure

For jobs you care about deeply, want immediate notification on failure — not just absence of a heartbeat.

Email on failure

Set MAILTO in your crontab and ensure the host can actually send mail:

MAILTO="oncall@example.com"
0 2 * * * /path/to/backup.sh

Cron emails any output (stdout or stderr) to this address. So a successful, silent job sends nothing; a job that fails noisily sends the error.

Note: if your job is expected to produce stdout (a report, for example), MAILTO floods the inbox with successful runs. Either suppress stdout (> /dev/null, keep stderr) or use a wrapper that only emails on failure.

Slack/Discord webhook on failure

#!/bin/bash
"$@"
EXIT=$?
if [ $EXIT -ne 0 ]; then
  curl -X POST -H 'Content-Type: application/json' \
    --data "{\"text\": \"❌ Cron job $1 failed with exit $EXIT\"}" \
    https://hooks.slack.com/services/YOUR/WEBHOOK
fi
exit $EXIT

5. Full observability

For production-critical jobs, treat them like services and emit structured logs:

Structured JSON logs

#!/bin/bash
START=$(date -Iseconds)
"$@" 2>&1
EXIT=$?
END=$(date -Iseconds)
jq -n --arg job "$1" --arg start "$START" --arg end "$END" --argjson exit $EXIT \
  '{job: $job, start: $start, end: $end, exit_code: $exit, ok: ($exit == 0)}' \
  >> /var/log/jobs.json
exit $EXIT

Now you can ship jobs.json to your log aggregator (Loki, ELK, Datadog, CloudWatch) and build dashboards: "all jobs in the last 24h", "jobs by failure rate", etc.

Metrics

If you have a Prometheus-compatible metrics endpoint, increment a counter:

curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_runs_total 1'
curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_duration_seconds '$DURATION
curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_exit_code '$EXIT

From Prometheus you can alert on "backup hasn't run in 26 hours" or "backup duration grew by 50%."

Tracing

For complex jobs, instrument with OpenTelemetry so you get spans for each phase of the job. This is overkill for most cron use cases but powerful for ETL pipelines that pull from multiple sources.

A pragmatic checklist

For any new cron job, ask yourself:

  • ☐ Is stdout/stderr captured to a log file?
  • ☐ Is the log rotated?
  • ☐ Is there a heartbeat ping that confirms the job ran?
  • ☐ Is the heartbeat configured with an alert recipient?
  • ☐ Are failures (non-zero exit) loudly alerted, not just logged?
  • ☐ Can someone debug a past failure without root access? (i.e., are logs readable by them)

If you tick all six, you have a production-grade cron job. Most teams stop after the first one and get bitten when a critical job silently breaks.

Related

Continue reading.