1. Log to a file with rotation
The simplest first step: capture stdout and stderr to a log file so you can see what your job did:
0 2 * * * /path/to/backup.sh >> /var/log/backup.log 2>&1
The 2>&1 sends stderr to the same place as stdout — critical for capturing error messages.
To keep the log from growing unbounded, add log rotation. Create /etc/logrotate.d/backup:
/var/log/backup.log {
weekly
rotate 4
compress
missingok
notifempty
}
This keeps four weeks of rotated logs, compressed.
Improvement: include a timestamp prefix
Without timestamps, you can't tell when each line was written. Pipe through ts (from moreutils) or use awk:
0 2 * * * /path/to/backup.sh 2>&1 | /usr/bin/ts >> /var/log/backup.log
Now each output line is prefixed with the local timestamp.
2. Capture exit codes
A successful job exits 0. Anything else is an error. Cron captures the exit code internally but doesn't tell you unless you ask. Wrap your job:
0 2 * * * /path/to/backup.sh; echo "$(date) exit=$?" >> /var/log/backup-runs.log
Or better, write a wrapper script that records start/end/exit code:
#!/bin/bash
JOB=$1
START=$(date +%s)
echo "[$(date)] $JOB started" >> /var/log/jobs.log
"$@" >> /var/log/jobs.log 2>&1
EXIT=$?
DURATION=$(( $(date +%s) - START ))
echo "[$(date)] $JOB finished exit=$EXIT duration=${DURATION}s" >> /var/log/jobs.log
exit $EXIT
Then cron each job as /usr/local/bin/run-job backup /path/to/backup.sh. Every job's run is now visible in /var/log/jobs.log.
3. Heartbeat / dead-man-switch monitoring
The trick: cron's silence is the bug. A job that doesn't fire produces no logs, no errors, no alerts. You need an external watcher that expects regular check-ins.
How it works
A heartbeat URL is provided by a third-party monitoring service. Your cron job hits it after each successful run:
0 2 * * * /path/to/backup.sh && curl -fsS --max-time 10 https://your-monitoring-service.example/ping/your-id
The && means the ping only fires if the job exited successfully. The service watches for these pings; if one is missing, it alerts you.
What to look for in a service
Several heartbeat-monitoring services exist — both commercial SaaS and self-hostable open-source options. When choosing one, evaluate against:
- Expected schedule: Can you describe your job's cadence (cron expression or interval) so the service knows when to expect pings?
- Grace period: How long after a missed ping does it alert? You want this configurable per check.
- Alert channels: Email at minimum; ideally also Slack, PagerDuty, Discord, or webhook for integration with your existing on-call setup.
- Free tier: Most offer one. Confirm it covers the number of jobs you'll be monitoring.
- Self-hostable: If you're privacy-sensitive or already run your own infra, some popular options have open-source versions you can deploy yourself.
- Pricing model: Per-check vs flat-rate vs usage-based — pick what matches your scale.
If you'd rather not depend on a third party, you can self-roll: any webhook endpoint plus a separate cron job that checks "did we get a ping in the last N minutes?" and alerts if not. Less polished but full control.
Why this matters more than file logs
File logs tell you what happened. Heartbeats tell you what didn't happen. If your nightly backup is supposed to run at 2 AM and the heartbeat service doesn't see a ping by 2:30 AM, it alerts you. With file logs alone, you'd find out the next time you needed the backup — possibly weeks later.
4. Alerting on failure
For jobs you care about deeply, want immediate notification on failure — not just absence of a heartbeat.
Email on failure
Set MAILTO in your crontab and ensure the host can actually send mail:
MAILTO="oncall@example.com" 0 2 * * * /path/to/backup.sh
Cron emails any output (stdout or stderr) to this address. So a successful, silent job sends nothing; a job that fails noisily sends the error.
Note: if your job is expected to produce stdout (a report, for example), MAILTO floods the inbox with successful runs. Either suppress stdout (> /dev/null, keep stderr) or use a wrapper that only emails on failure.
Slack/Discord webhook on failure
#!/bin/bash
"$@"
EXIT=$?
if [ $EXIT -ne 0 ]; then
curl -X POST -H 'Content-Type: application/json' \
--data "{\"text\": \"❌ Cron job $1 failed with exit $EXIT\"}" \
https://hooks.slack.com/services/YOUR/WEBHOOK
fi
exit $EXIT
5. Full observability
For production-critical jobs, treat them like services and emit structured logs:
Structured JSON logs
#!/bin/bash
START=$(date -Iseconds)
"$@" 2>&1
EXIT=$?
END=$(date -Iseconds)
jq -n --arg job "$1" --arg start "$START" --arg end "$END" --argjson exit $EXIT \
'{job: $job, start: $start, end: $end, exit_code: $exit, ok: ($exit == 0)}' \
>> /var/log/jobs.json
exit $EXIT
Now you can ship jobs.json to your log aggregator (Loki, ELK, Datadog, CloudWatch) and build dashboards: "all jobs in the last 24h", "jobs by failure rate", etc.
Metrics
If you have a Prometheus-compatible metrics endpoint, increment a counter:
curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_runs_total 1' curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_duration_seconds '$DURATION curl -X POST 'http://pushgateway:9091/metrics/job/backup' --data 'backup_exit_code '$EXIT
From Prometheus you can alert on "backup hasn't run in 26 hours" or "backup duration grew by 50%."
Tracing
For complex jobs, instrument with OpenTelemetry so you get spans for each phase of the job. This is overkill for most cron use cases but powerful for ETL pipelines that pull from multiple sources.
A pragmatic checklist
For any new cron job, ask yourself:
- ☐ Is stdout/stderr captured to a log file?
- ☐ Is the log rotated?
- ☐ Is there a heartbeat ping that confirms the job ran?
- ☐ Is the heartbeat configured with an alert recipient?
- ☐ Are failures (non-zero exit) loudly alerted, not just logged?
- ☐ Can someone debug a past failure without root access? (i.e., are logs readable by them)
If you tick all six, you have a production-grade cron job. Most teams stop after the first one and get bitten when a critical job silently breaks.