Reconciliation
Ring's scheduler is a loop that compares desired state (what you wrote in YAML) to actual state (what the runtime reports) and issues commands to close the gap. Same model as Terraform or Kubernetes: there is no event-driven imperative pipeline; everything converges from periodic ticks.
The tick
every <interval>: for each deployment in DB where status != deleted: observed = runtime.list_instances(deployment) desired = deployment.replicas if len(observed) < desired: create the missing instances if len(observed) > desired: remove the surplus for each instance: run health checks, record results, fire on_failure for each deployment in DB where status == deleted: runtime.remove_all_instances(deployment); purge
Default interval: 10 seconds. Override with RING_SCHEDULER_INTERVAL=<seconds> or [scheduler] interval = <seconds> in config.toml. Faster ticks mean faster recovery and faster health checks at the cost of more CPU.
The whole runtime.apply() for one deployment is wrapped in tokio::time::timeout(RING_APPLY_TIMEOUT) (default 300s). It bounds one deployment's work inside one tick, not the whole cycle and not the ring apply client call.
Workers vs jobs
The reconciler treats kind: worker and kind: job differently.
Worker (default)
A long-running service. The reconciler keeps exactly replicas instances alive. If a container crashes or you delete it manually, the next tick recreates it. Updating the manifest triggers a rolling update (if health checks are declared) or an immediate replacement.
A worker reaches running as soon as its container is up, unless it declares a readiness: true check, in which case it stays creating until that check is green (see Health checks: the readiness gate). A readiness check that never turns green fails the deployment after RING_ROLLOUT_DEADLINE (default 600s).
For the full set of statuses a deployment can hold, what moves it between them, and which are terminal, see Deployment status lifecycle.
Job
A one-shot task. The reconciler boots one instance (replicas is ignored), waits for it to exit, and records the result:
| Exit | Final status |
|---|---|
| Container exits with code 0 | completed |
| Container exits with non-zero code | failed |
| Container is killed by OOM / signal | failed |
| Job times out (host-side) | failed |
On Cloud Hypervisor, the host can't see the guest's exit code, so any clean VM shutdown is treated as completed. Use a worker for anything that needs precise exit-code semantics on CH.
Rolling updates
A deployment that declares at least one health check gets a rolling update on ring apply:
- Ring finds an active deployment with the same
name+namespace. - A child deployment is created (with
parent_idpointing at the old one) using the new manifest. - The reconciler boots the child's instances. Old containers keep serving traffic.
- Once the child's readiness gate opens (see Health checks), Ring removes one old instance.
- When the parent has zero instances, it's marked
deleted.
If the child never becomes healthy, the parent stays running and the child is marked failed. No traffic is dropped, no operator action needed to roll back: just inspect and ring apply a fix.
Rolling updates are skipped (immediate replacement, brief downtime) when:
- The deployment declares no health checks
ring apply --forceis set- Multiple active deployments share the same
name+namespace(unusual; fix the duplicates first)
Each skip emits a ForceReplace event with the precise reason.
Crash detection
How fast Ring notices a dead container depends on the runtime:
| Runtime | Detection path | Latency |
|---|---|---|
| Docker | Live Docker event stream (die, oom, kill, start) plus tick-based reconciliation | Sub-second for crashes; tick-bound for slower failures |
| Cloud Hypervisor | Tick-based scan of .sock files in socket_dir; no event stream from CH | Bounded by [scheduler] interval |
In both cases, the missing instance is recreated automatically on the next tick.
Restart policy
Ring tracks restart attempts per deployment. Past MAX_RESTART_COUNT (currently 5) failed boots, the deployment lands in crash_loop_back_off and the reconciler stops trying. The counter is cumulative for the lifetime of the deployment, not a sliding window; fix the underlying issue and re-apply the manifest to reset.
This protects the host from a tight crash loop pegging Docker / Cloud Hypervisor.
What survives a ring server restart
Every input to the loop lives in SQLite:
- Deployments and their desired state
- Instance records (which container/VM corresponds to which deployment)
- Health check history (last 7 days, cap 50 per deployment)
- Events
Two things don't survive:
- Health-check failure counters: they live in memory. After a restart, each
(deployment, instance, check)triple starts back at zero. A flapping service won't triggeron_failureimmediately after a server restart. - In-flight runtime operations: if
ring servercrashes mid-apply, the partial state is detected on the next tick and the reconciler converges.
See also
- Health checks design: how probes feed the loop
- Architecture: where the reconciler sits in the process
- Runtimes: what the runtime adapter actually does