Best practices

Scheduling best practices.

Writing a cron job is easy. Writing a cron job that you can confidently run unattended for years is harder. This guide collects the patterns that experienced operators apply to every scheduled job, regardless of platform.

Make every job idempotent

An idempotent job produces the same result whether it runs once or five times. This is the single most important property of any scheduled job — without it, every problem (DST duplication, retry storms, manual reruns) becomes a data-corruption incident.

Idempotent patterns

  • Database upserts: INSERT … ON CONFLICT DO NOTHING or ON DUPLICATE KEY UPDATE
  • File-based sentinels: at the end of the job, write a "done" marker; check for it at the start
  • Versioned outputs: write to output-2024-01-15-T02-30.json, never overwrite
  • Atomic moves: write to file.tmp, rename to file only on success — readers never see partial output

Non-idempotent patterns to avoid

  • "Send 100 emails to people who haven't received today's report" → if you run it twice, you send 200 emails. Fix: mark each person as sent, only select unsent.
  • "Increment a counter by 1" → if interrupted, the counter is wrong. Fix: store a state machine, not a delta.
  • "Append a line to a log file" → at scale this is fine, but if multiple instances run concurrently the file is corrupted. Fix: log per-PID, or use a write-once-per-key scheme.

Prevent overlapping runs

A long-running job that takes 65 minutes, scheduled hourly, will eventually have two instances running at once. They'll race, corrupt each other's output, or both fail because of database locks.

flock — the simplest solution

0 * * * * /usr/bin/flock -n /tmp/myjob.lock /usr/local/bin/myjob.sh

flock -n exits immediately if the lock is held — so if a previous instance is still running, the new run silently exits. This is usually what you want for periodic jobs.

Process check

pgrep -f myjob.sh > /dev/null && exit 0
# ... rest of script ...

Database-backed lock (for distributed jobs)

If multiple servers might run the same job, file-based locks aren't enough. Use a centralized lock — Redis SETNX, PostgreSQL advisory locks, or a "scheduler_lock" row.

Add jitter to avoid stampedes

If 100 jobs all scheduled at 0 * * * * fire simultaneously, they may overwhelm shared resources (databases, APIs, the load balancer). Jenkins solves this with its H (hash) operator; for plain cron, you have two options.

Schedule at off-peak minutes

Instead of every job firing at :00, distribute them:

3 * * * *  /path/to/job-a.sh    # :03 past every hour
7 * * * *  /path/to/job-b.sh    # :07
13 * * * * /path/to/job-c.sh    # :13

Add sleep at the top of the script

sleep $((RANDOM % 60))         # Random delay 0-59 seconds
# ... real work ...

Each invocation starts at a different time within the minute. Useful when you have 1,000 servers all running the same cron at the same time and you want to spread their load on a shared backend.

Pick the right time window

Off-hours batch jobs

For nightly backups, ETL runs, and other batch work, schedule between 2 AM and 4 AM local time. This avoids:

  • The 1 AM DST transition window (see DST guide)
  • The midnight/end-of-day spike when other jobs cluster at 00:00
  • Business hours when users would notice slowness

Don't use round numbers

Everyone schedules at 0 0 * * *. The bottom of every hour is the busiest minute on the internet. Schedule at :03, :13, :23 etc. — the difference in run time is invisible but the load-spreading is real.

Consider downstream load

If your job calls a third-party API, check their rate limits and recommended off-peak hours. GitHub Actions cron is famously delayed at 00:00 UTC because of how many people pick midnight.

Retry strategy

Not every failure should retry. The right strategy depends on the type of failure:

Failure typeStrategy
Transient (network, 5xx, timeout)Retry with exponential backoff
Rate-limited (429)Retry with backoff that respects the Retry-After header
Authentication (401, 403)Do not retry — alert immediately, credentials need fixing
Bad input (400, ValidationError)Do not retry — the input won't change
UnknownRetry 2-3 times, then alert

Idempotency makes retries safe

If you've followed the first principle, retries are free. If you haven't, retries make corruption worse.

Cap retry duration

If a job is supposed to take 10 minutes and is now 6 hours in, it's not retrying — it's stuck. Use timeout as a safety belt:

0 2 * * * timeout 30m /path/to/script.sh || alert-on-failure

Bake in observability

(See our monitoring guide for depth.)

Minimum viable observability for any production cron job:

  • Stdout/stderr to a rotated log file
  • Exit code recorded
  • Heartbeat ping on success
  • Loud alert on failure

Environment hygiene

1. Set PATH explicitly

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 9 * * * /path/to/script.sh

2. Set the working directory

Don't depend on relative paths. Either cd at the top of the script or use absolute paths.

3. Pin the shell

SHELL=/bin/bash
0 9 * * * /path/to/script.sh

4. Document the schedule in the script itself

#!/bin/bash
# Schedule: 0 2 * * *  (every day at 2 AM UTC)
# Owner: data-platform
# Runbook: https://wiki/runbooks/nightly-backup

When someone finds this script 18 months later, they should be able to figure out everything they need to know without git-log archaeology.

The 10-second summary

  1. Idempotent: safe to run twice
  2. Locked: won't run twice concurrently
  3. Off-peak: scheduled outside DST windows and round numbers
  4. Monitored: stdout to log, heartbeat ping on success, alert on failure
  5. Documented: the script itself explains what, who, and how

Do these five things on every job. Almost every "cron disaster" you'll read about traces back to a job that skipped one or more of them.

Related

Continue reading.