Cron Scheduling Best Practices

In this guide

Make every job idempotent
Prevent overlapping runs
Add jitter to avoid stampedes
Pick the right time window
Retry strategy
Bake in observability
Environment hygiene

Make every job idempotent

An idempotent job produces the same result whether it runs once or five times. This is the single most important property of any scheduled job — without it, every problem (DST duplication, retry storms, manual reruns) becomes a data-corruption incident.

Idempotent patterns

Database upserts: INSERT … ON CONFLICT DO NOTHING or ON DUPLICATE KEY UPDATE
File-based sentinels: at the end of the job, write a "done" marker; check for it at the start
Versioned outputs: write to output-2024-01-15-T02-30.json, never overwrite
Atomic moves: write to file.tmp, rename to file only on success — readers never see partial output

Non-idempotent patterns to avoid

"Send 100 emails to people who haven't received today's report" → if you run it twice, you send 200 emails. Fix: mark each person as sent, only select unsent.
"Increment a counter by 1" → if interrupted, the counter is wrong. Fix: store a state machine, not a delta.
"Append a line to a log file" → at scale this is fine, but if multiple instances run concurrently the file is corrupted. Fix: log per-PID, or use a write-once-per-key scheme.

Prevent overlapping runs

A long-running job that takes 65 minutes, scheduled hourly, will eventually have two instances running at once. They'll race, corrupt each other's output, or both fail because of database locks.

flock — the simplest solution

0 * * * * /usr/bin/flock -n /tmp/myjob.lock /usr/local/bin/myjob.sh

flock -n exits immediately if the lock is held — so if a previous instance is still running, the new run silently exits. This is usually what you want for periodic jobs.

Process check

pgrep -f myjob.sh > /dev/null && exit 0
# ... rest of script ...

Database-backed lock (for distributed jobs)

If multiple servers might run the same job, file-based locks aren't enough. Use a centralized lock — Redis SETNX, PostgreSQL advisory locks, or a "scheduler_lock" row.

Add jitter to avoid stampedes

If 100 jobs all scheduled at 0 * * * * fire simultaneously, they may overwhelm shared resources (databases, APIs, the load balancer). Jenkins solves this with its H (hash) operator; for plain cron, you have two options.

Schedule at off-peak minutes

Instead of every job firing at :00, distribute them:

3 * * * *  /path/to/job-a.sh    # :03 past every hour
7 * * * *  /path/to/job-b.sh    # :07
13 * * * * /path/to/job-c.sh    # :13

Add sleep at the top of the script

sleep $((RANDOM % 60))         # Random delay 0-59 seconds
# ... real work ...

Each invocation starts at a different time within the minute. Useful when you have 1,000 servers all running the same cron at the same time and you want to spread their load on a shared backend.

Pick the right time window

Off-hours batch jobs

For nightly backups, ETL runs, and other batch work, schedule between 2 AM and 4 AM local time. This avoids:

The 1 AM DST transition window (see DST guide)
The midnight/end-of-day spike when other jobs cluster at 00:00
Business hours when users would notice slowness

Don't use round numbers

Everyone schedules at 0 0 * * *. The bottom of every hour is the busiest minute on the internet. Schedule at :03, :13, :23 etc. — the difference in run time is invisible but the load-spreading is real.

Consider downstream load

If your job calls a third-party API, check their rate limits and recommended off-peak hours. GitHub Actions cron is famously delayed at 00:00 UTC because of how many people pick midnight.

Retry strategy

Not every failure should retry. The right strategy depends on the type of failure:

Failure type	Strategy
Transient (network, 5xx, timeout)	Retry with exponential backoff
Rate-limited (429)	Retry with backoff that respects the Retry-After header
Authentication (401, 403)	Do not retry — alert immediately, credentials need fixing
Bad input (400, ValidationError)	Do not retry — the input won't change
Unknown	Retry 2-3 times, then alert

Idempotency makes retries safe

If you've followed the first principle, retries are free. If you haven't, retries make corruption worse.

Cap retry duration

If a job is supposed to take 10 minutes and is now 6 hours in, it's not retrying — it's stuck. Use timeout as a safety belt:

0 2 * * * timeout 30m /path/to/script.sh || alert-on-failure

Bake in observability

(See our monitoring guide for depth.)

Minimum viable observability for any production cron job:

Stdout/stderr to a rotated log file
Exit code recorded
Heartbeat ping on success
Loud alert on failure

Environment hygiene

1. Set PATH explicitly

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 9 * * * /path/to/script.sh

2. Set the working directory

Don't depend on relative paths. Either cd at the top of the script or use absolute paths.

3. Pin the shell

SHELL=/bin/bash
0 9 * * * /path/to/script.sh

4. Document the schedule in the script itself

#!/bin/bash
# Schedule: 0 2 * * *  (every day at 2 AM UTC)
# Owner: data-platform
# Runbook: https://wiki/runbooks/nightly-backup

When someone finds this script 18 months later, they should be able to figure out everything they need to know without git-log archaeology.

The 10-second summary

Idempotent: safe to run twice
Locked: won't run twice concurrently
Off-peak: scheduled outside DST windows and round numbers
Monitored: stdout to log, heartbeat ping on success, alert on failure
Documented: the script itself explains what, who, and how

Do these five things on every job. Almost every "cron disaster" you'll read about traces back to a job that skipped one or more of them.

Written by Murugan Vellaichamy, software engineer. Suggestions or corrections? Contact us.

📖 Guide

Scheduling best practices.