Make every job idempotent
An idempotent job produces the same result whether it runs once or five times. This is the single most important property of any scheduled job — without it, every problem (DST duplication, retry storms, manual reruns) becomes a data-corruption incident.
Idempotent patterns
- Database upserts:
INSERT … ON CONFLICT DO NOTHINGorON DUPLICATE KEY UPDATE - File-based sentinels: at the end of the job, write a "done" marker; check for it at the start
- Versioned outputs: write to
output-2024-01-15-T02-30.json, never overwrite - Atomic moves: write to
file.tmp, rename tofileonly on success — readers never see partial output
Non-idempotent patterns to avoid
- "Send 100 emails to people who haven't received today's report" → if you run it twice, you send 200 emails. Fix: mark each person as sent, only select unsent.
- "Increment a counter by 1" → if interrupted, the counter is wrong. Fix: store a state machine, not a delta.
- "Append a line to a log file" → at scale this is fine, but if multiple instances run concurrently the file is corrupted. Fix: log per-PID, or use a write-once-per-key scheme.
Prevent overlapping runs
A long-running job that takes 65 minutes, scheduled hourly, will eventually have two instances running at once. They'll race, corrupt each other's output, or both fail because of database locks.
flock — the simplest solution
0 * * * * /usr/bin/flock -n /tmp/myjob.lock /usr/local/bin/myjob.sh
flock -n exits immediately if the lock is held — so if a previous instance is still running, the new run silently exits. This is usually what you want for periodic jobs.
Process check
pgrep -f myjob.sh > /dev/null && exit 0 # ... rest of script ...
Database-backed lock (for distributed jobs)
If multiple servers might run the same job, file-based locks aren't enough. Use a centralized lock — Redis SETNX, PostgreSQL advisory locks, or a "scheduler_lock" row.
Add jitter to avoid stampedes
If 100 jobs all scheduled at 0 * * * * fire simultaneously, they may overwhelm shared resources (databases, APIs, the load balancer). Jenkins solves this with its H (hash) operator; for plain cron, you have two options.
Schedule at off-peak minutes
Instead of every job firing at :00, distribute them:
3 * * * * /path/to/job-a.sh # :03 past every hour 7 * * * * /path/to/job-b.sh # :07 13 * * * * /path/to/job-c.sh # :13
Add sleep at the top of the script
sleep $((RANDOM % 60)) # Random delay 0-59 seconds # ... real work ...
Each invocation starts at a different time within the minute. Useful when you have 1,000 servers all running the same cron at the same time and you want to spread their load on a shared backend.
Pick the right time window
Off-hours batch jobs
For nightly backups, ETL runs, and other batch work, schedule between 2 AM and 4 AM local time. This avoids:
- The 1 AM DST transition window (see DST guide)
- The midnight/end-of-day spike when other jobs cluster at 00:00
- Business hours when users would notice slowness
Don't use round numbers
Everyone schedules at 0 0 * * *. The bottom of every hour is the busiest minute on the internet. Schedule at :03, :13, :23 etc. — the difference in run time is invisible but the load-spreading is real.
Consider downstream load
If your job calls a third-party API, check their rate limits and recommended off-peak hours. GitHub Actions cron is famously delayed at 00:00 UTC because of how many people pick midnight.
Retry strategy
Not every failure should retry. The right strategy depends on the type of failure:
| Failure type | Strategy |
|---|---|
| Transient (network, 5xx, timeout) | Retry with exponential backoff |
| Rate-limited (429) | Retry with backoff that respects the Retry-After header |
| Authentication (401, 403) | Do not retry — alert immediately, credentials need fixing |
| Bad input (400, ValidationError) | Do not retry — the input won't change |
| Unknown | Retry 2-3 times, then alert |
Idempotency makes retries safe
If you've followed the first principle, retries are free. If you haven't, retries make corruption worse.
Cap retry duration
If a job is supposed to take 10 minutes and is now 6 hours in, it's not retrying — it's stuck. Use timeout as a safety belt:
0 2 * * * timeout 30m /path/to/script.sh || alert-on-failure
Bake in observability
(See our monitoring guide for depth.)
Minimum viable observability for any production cron job:
- Stdout/stderr to a rotated log file
- Exit code recorded
- Heartbeat ping on success
- Loud alert on failure
Environment hygiene
1. Set PATH explicitly
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 0 9 * * * /path/to/script.sh
2. Set the working directory
Don't depend on relative paths. Either cd at the top of the script or use absolute paths.
3. Pin the shell
SHELL=/bin/bash 0 9 * * * /path/to/script.sh
4. Document the schedule in the script itself
#!/bin/bash # Schedule: 0 2 * * * (every day at 2 AM UTC) # Owner: data-platform # Runbook: https://wiki/runbooks/nightly-backup
When someone finds this script 18 months later, they should be able to figure out everything they need to know without git-log archaeology.
The 10-second summary
- Idempotent: safe to run twice
- Locked: won't run twice concurrently
- Off-peak: scheduled outside DST windows and round numbers
- Monitored: stdout to log, heartbeat ping on success, alert on failure
- Documented: the script itself explains what, who, and how
Do these five things on every job. Almost every "cron disaster" you'll read about traces back to a job that skipped one or more of them.