Elixir’s Fault Tolerance in Distributed Systems
Why it matters now: more services, more nodes, and more partial failures

When you operate a distributed system, the interesting failures are rarely the ones that crash a single process. They are the slow ones, the inconsistent ones, the ones that sit quietly between services and metastasize into timeouts and retries. Elixir, built on the Erlang VM (BEAM), was designed from the ground up for systems where partial failure is normal. It treats faults as a first-class design concern rather than an afterthought. That philosophy, combined with lightweight processes, message passing, and a distribution protocol baked into the runtime, makes it a practical choice for teams building resilient backends, real-time platforms, and edge-to-cloud architectures today.
In this post, we’ll look at how Elixir delivers fault tolerance in distributed settings, where it shines, and where it doesn’t. We’ll walk through code patterns that turn ideas like “let it crash” and “link supervision” into production-grade behaviors. I’ll also share notes from real projects, including the kind of pitfalls that don’t show up in tutorials, and a path to get started without getting lost in Erlang arcana.
Where Elixir fits in today’s distributed landscape
Elixir isn’t trying to replace Kafka, Kubernetes, or Postgres. It complements them. You’ll often see Elixir at the center of real-time services, API gateways, WebSocket fan-out layers, multi-tenant applications, and embedded/IoT controllers that sync to the cloud. In these contexts, developers lean on the BEAM’s concurrency model to handle many simultaneous connections with modest resources, while still benefiting from OTP abstractions for reliability and upgradeability.
Compared to alternatives:
- In Go or Java, you typically build fault tolerance with libraries and patterns (circuit breakers, retries, worker pools). Elixir provides primitives at the runtime level, which means less custom glue code and more predictable failure handling.
- Node.js is excellent for I/O-bound workloads but lacks the BEAM’s preemption and per-process mailboxes; you often rely on external orchestration for resilience. Elixir’s supervision trees turn resilience into a structural property of the application.
- Rust gives you fearless concurrency and performance, but fault isolation tends to be more explicit and heavier-weight. In Elixir, isolated processes are cheap, and the supervisor model automates restarts and dependency management.
Real-world usage tends to cluster around problems where soft real-time behavior, upgradeability, and graceful degradation are important. Examples: chat and collaboration tools, telematics platforms, live dashboards, fraud detection pipelines, and control systems that blend edge devices with cloud coordination.
Core ideas that make Elixir resilient in a distributed context
Processes as the unit of isolation and recovery
An Elixir process is a lightweight, preemptively scheduled VM entity, not an OS thread. Each process has its own mailbox and heap, and it fails independently. When a process crashes, its supervisor decides what to do next. This simple loop of do → crash → restart → continue is the backbone of Elixir fault tolerance.
Supervisors structure restart strategies:
- One-for-one: restart only the failing child.
- Rest-for-one: restart the failing child and any that started after it.
- Simple-one-for-one: dynamic children of the same spec.
In a distributed context, processes on different nodes still follow the same rules. The VM handles network visibility and message passing, so your supervision tree spans machines without you rewriting your business logic.
Links, monitors, and the “let it crash” philosophy
Links and monitors provide cross-process failure notifications:
- Link: bidirectional; if either process exits, the other gets an exit signal.
- Monitor: unidirectional; the monitoring process receives a message if the monitored process exits.
“Let it crash” is not about ignoring errors. It’s about isolating failure where it occurs and letting the supervisor restore a known-good state. This keeps the happy path simple and the recovery path explicit.
Distribution and global namespaces
Elixir nodes can connect to each other via a short name or a long name. Once connected, you can register processes locally and globally, and send messages across nodes as easily as you would locally.
# Start two nodes and connect them.
# In one terminal:
# iex --sname alpha -S mix
# In another:
# iex --sname beta -S mix
# Inside the "alpha" IEx shell:
Node.connect(:beta@your-hostname)
Node.list()
#=> [:beta@your-hostname]
# Register a process globally on "beta"
# Inside "beta" shell:
Process.register(self(), :worker_beta)
# Send a message from "alpha" to "beta"
send({:worker_beta, :beta@your-hostname}, {:ping, self()})
# Receive the pong back on "alpha"
receive do
{:pong, ref} -> IO.puts("Got pong from #{inspect(ref)}")
end
For global uniqueness, you can use :global to register names that are visible cluster-wide. For example, electing a single leader or a singleton worker across the cluster:
# On both nodes:
:global.register_name(:global_counter, self())
# Lookup the name and send a message:
case :global.whereis_name(:global_counter) do
:undefined -> :not_found
pid -> send(pid, {:inc, 1})
end
Beware: global registration is a strong consistency choice. It’s useful for singletons and coordination roles, but it requires care with netsplits and conflicting registrations.
OTP: GenServer, Tasks, and supervisors in practice
GenServer is the workhorse for stateful services. In distributed contexts, you typically favor stateless workers or explicitly replicated state, because network partitions make strong consistency hard. Still, it’s common to place a GenServer on a specific node to control locality or to avoid hotspots.
# lib/my_app/counter.ex
defmodule MyApp.Counter do
use GenServer
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def inc(by \\ 1), do: GenServer.cast(__MODULE__, {:inc, by})
def get(), do: GenServer.call(__MODULE__, :get)
def init(_opts) do
{:ok, 0}
end
def handle_cast({:inc, by}, count) do
{:noreply, count + by}
end
def handle_call(:get, _from, count) do
{:reply, count, count}
end
end
In a distributed deployment, you might run this counter only on one node (e.g., the node elected as leader), or you might replicate it using something like Phoenix PubSub or CQRS patterns. The point is that the failure of the counter process is isolated; its supervisor decides whether and how to restart it.
Tasks and async workflows
For distributed data aggregation, Elixir’s Task.async_stream is handy. It runs work across available cores and can be tuned to respect backpressure.
# lib/my_app/aggregator.ex
defmodule MyApp.Aggregator do
def fetch_all(urls) do
Task.async_stream(urls, fn url ->
{:ok, {url, http_get(url)}}
end, max_concurrency: 8, timeout: 15_000)
|> Enum.reduce([], fn
{:ok, {:ok, result}}, acc -> [result | acc]
_, acc -> acc
end)
end
defp http_get(url) do
# Use your HTTP client of choice (e.g., Finch or Req)
# This is a minimal example; handle errors and retries here.
case :httpc.request(:get, {to_charlist(url), []}, [], []) do
{:ok, {{_, status, _}, _, body}} when status in 200..299 ->
{:ok, to_string(body)}
_ ->
{:error, :bad_response}
end
end
end
In a cluster, you could wrap this to pull work from a queue and push results to a PubSub topic. Because Tasks are lightweight, you can fan out across nodes by sending tasks to specific workers or via a job distribution layer.
Distribution under the hood
Elixir uses Erlang’s distribution protocol, which is TCP-based and includes node names, cookies, and connection management. You can secure it with TLS and firewall rules. The runtime monitors node connections; when a node goes down, processes linked or monitoring that node receive exit signals or messages, and supervisors react accordingly.
For production clustering, you’ll often combine:
- Node name conventions and a reliable hostname strategy.
- A secure cookie for authentication.
- A cluster formation tool like libcluster, which uses strategies such as DNS, Kubernetes, or gossip to auto-connect nodes.
- A supervision strategy that tolerates temporary netsplits and avoids split-brain when possible.
You can read more about Erlang distribution in the Erlang/OTP documentation: https://www.erlang.org/doc/man/epmd.html and https://www.erlang.org/doc/reference_manual/distributed.html.
Practical patterns for fault-tolerant distributed Elixir
Pattern 1: Supervision across nodes
Design your supervision tree to remain functional when nodes disappear. Avoid global singletons unless you can tolerate the coordination cost. For critical services, run replicas on multiple nodes and use leader election or consistent hashing to route work.
Example structure for a multi-node app:
lib/
my_app/
application.ex
cluster/
supervisor.ex
connector.ex
workers/
counter.ex
collector.ex
pubsub/
bus.ex
lib/my_app/cluster/supervisor.ex:
defmodule MyApp.Cluster.Supervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
def init(_init_arg) do
children = [
# libcluster connects nodes automatically
{Cluster.Supervisor, [topologies: [my_app: [strategy: :gossip]]]},
# Your distributed workers
MyApp.Workers.Supervisor
]
Supervisor.init(children, strategy: :one_for_one)
end
end
lib/my_app/workers/supervisor.ex:
defmodule MyApp.Workers.Supervisor do
use DynamicSupervisor
def start_link(init_arg) do
DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
def init(_init_arg) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_child(spec) do
DynamicSupervisor.start_child(__MODULE__, spec)
end
end
Pattern 2: Let it crash, then recover state
Suppose you have a connection pool to an external service. It’s better to isolate connection handling in its own process and let a supervisor restart it on failure than to attempt complex self-healing logic inside the worker.
# lib/my_app/connection_worker.ex
defmodule MyApp.ConnectionWorker do
use GenServer
def start_link(opts) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def send_message(msg) do
GenServer.call(__MODULE__, {:send, msg})
end
def init(_opts) do
# Simulate connecting to an external service
{:ok, socket} = :gen_tcp.connect('example.com', 80, [:binary, active: false])
{:ok, %{socket: socket}}
end
def handle_call({:send, msg}, _from, %{socket: socket} = state) do
:ok = :gen_tcp.send(socket, "POST /ingest HTTP/1.1\r\nHost: example.com\r\nContent-Length: #{byte_size(msg)}\r\n\r\n#{msg}")
{:reply, :ok, state}
end
def handle_info({:tcp_closed, _socket}, _state) do
# The connection dropped; let the process exit and be restarted by the supervisor.
exit(:normal)
end
end
# lib/my_app/connection_supervisor.ex
defmodule MyApp.ConnectionSupervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
def init(_init_arg) do
children = [
%{
id: MyApp.ConnectionWorker,
start: {MyApp.ConnectionWorker, :start_link, [[]]},
restart: :transient
}
]
Supervisor.init(children, strategy: :one_for_one)
end
end
Pattern 3: Backpressure and timeouts
When tasks run across nodes, you need boundaries. Task.async_stream with max_concurrency is a simple way to limit concurrency. For pipelines, consider GenStage or Broadway for robust backpressure and batch processing, especially when dealing with external systems.
Broadway example for a data ingestion pipeline (conceptual):
# mix.exs
defp deps do
[
{:broadway, "~> 1.0"},
{:finch, "~> 0.16"}
]
end
# lib/my_app/ingestion.ex
defmodule MyApp.Ingestion do
use Broadway
alias Broadway.Message
def start_link(_opts) do
Broadway.start_link(__MODULE__,
name: __MODULE__,
producer: [
module: {MyApp.QueueProducer, []},
transformer: {MyApp.Transformer, :transform, []}
],
processors: [
default: [concurrency: 8]
],
batchers: [
default: [concurrency: 4, batch_size: 100, batch_timeout: 2_000]
]
)
end
def handle_message(_processor, message, _context) do
message
|> Message.update_data(&process_record/1)
end
def handle_batch(_batcher, messages, _batch_info, _context) do
# Persist or forward the batch
Enum.each(messages, fn msg ->
IO.inspect(msg.data)
end)
messages
end
defp process_record(record) do
# Your processing logic here
record
end
end
In a distributed context, the queue producer can consume from a shared message bus (e.g., RabbitMQ, Kafka, or Phoenix PubSub across nodes), and the batchers can write to an external store. Broadway handles concurrency and demand; your business logic remains isolated and easy to test.
Pattern 4: Handling netsplits and global consistency
Global names and leader election are convenient but require caution during netsplits. The BEAM will mark nodes as hidden or disconnected; processes in the disconnected partition can still run, which may lead to conflicting roles if you rely on a singleton.
Strategies:
- Prefer idempotent operations and conflict-free replicated data types (CRDTs) where possible.
- Use explicit coordination via a consensus service (e.g., Raft-based systems) or a distributed lock manager when strong consistency is required.
- Avoid running a single global process without a watchdog; if it’s crucial, run a standby on another node and implement a health-check-based failover.
If you need to inspect cluster health programmatically:
# List nodes and their status
nodes = Node.list()
alive = Enum.filter(nodes, &Node.alive?/1)
# Check if a specific process exists on a node
defp process_alive_on_node?(name, node) do
case :global.whereis_name(name) do
:undefined -> false
pid -> node(pid) == node and Process.alive?(pid)
end
end
Pattern 5: Observability and graceful degradation
Fault tolerance relies on visibility. Use telemetry to emit metrics from your GenServers and tasks. Attach handlers to collect histograms, counts, and error rates. When a node struggles, degrade functionality rather than cascade failures.
Example telemetry span around a critical function:
defmodule MyApp.Telemetry do
def instrument(name, fun) do
start = System.monotonic_time(:millisecond)
:telemetry.execute([name, :start], %{system_time: System.system_time()})
result =
try do
fun.()
catch
kind, reason ->
stack = __STACKTRACE__
:telemetry.execute([name, :exception], %{kind: kind, reason: reason})
:erlang.raise(kind, reason, stack)
end
duration = System.monotonic_time(:millisecond) - start
:telemetry.execute([name, :stop], %{duration: duration})
result
end
end
# Usage:
MyApp.Telemetry.instrument(:ingest, fn ->
MyApp.Ingestion.ingest_one(payload)
end)
Connect these events to your metrics backend (Prometheus, Datadog, etc.) and add alerts on elevated restart rates or long mailbox sizes.
Honest evaluation: strengths, tradeoffs, and when to choose Elixir
Strengths:
- Built-in concurrency and isolation reduce the amount of custom resilience code you need to write.
- Supervision trees make recovery predictable and testable.
- Distribution is a runtime feature, not an add-on, enabling multi-node deployments without rewriting business logic.
- Hot code upgrades allow long-lived systems to evolve without downtime, which is valuable for embedded/IoT and 24/7 services.
- The ecosystem around real-time and streaming (Phoenix, PubSub, GenStage, Broadway) is mature and practical.
Weaknesses:
- Global coordination and strong consistency are harder than in centralized systems. Partitioned networks can cause split-brain if not designed carefully.
- The VM’s scheduler favors low-latency, short-lived work; very long-running CPU-heavy tasks can block schedulers unless handled with dirty schedulers or native code.
- The ecosystem is strong for I/O-bound and real-time workloads but less diverse in data science, heavy numerical computing, and some specialized enterprise integrations.
- The mental model of “processes everywhere” is powerful but can lead to over-engineering if applied indiscriminately.
When Elixir is a good fit:
- Real-time services with many concurrent users or devices.
- Systems where graceful degradation and fast recovery are valued over strict consistency.
- Long-lived services that need online upgrades and dynamic code changes.
- Projects where teams want to keep operational complexity low while scaling horizontally.
When you might skip or complement it:
- Compute-heavy pipelines requiring specialized GPU workloads; pair with a different stack or isolate compute jobs outside the BEAM.
- Systems with hard global consistency requirements; you may still use Elixir for orchestration, but rely on external consensus services for critical coordination.
- Teams with strong dependencies on specific vendor SDKs that lack mature Erlang/Elixir support.
Personal experience: lessons from production
Across several projects, a few patterns repeat:
- The biggest wins came from supervision strategy decisions, not algorithmic cleverness. Defining the right restart intensity and backoff for each service eliminated entire classes of outages.
- A common mistake is putting too much state in a single GenServer “because it’s easy.” Over time, that server becomes a bottleneck and a single point of failure. Splitting state by tenant or sharding by key and distributing work across nodes scales better and isolates failures.
- Netsplits happen more often than teams expect, especially in edge-to-cloud setups. Designing for eventual consistency, using idempotent APIs, and carefully gating singleton processes with timeouts prevented split-brain in our deployments.
- Observability pays off. In one project, telemetry revealed that most crashes were caused by third-party timeouts; we added circuit breakers around those calls and saw restarts drop by 80% overnight.
- The learning curve is front-loaded. New Elixir developers often wrestle with the OTP abstractions and the “let it crash” mindset. Pairing them on small, supervised services accelerates understanding without risking core business logic.
Getting started: workflow and mental model
You don’t need to master Erlang to build resilient Elixir systems. Focus on these workflows:
- Model your system as a supervision tree. Identify which services are essential and which can restart without side effects.
- Start with one node. Add distribution only when you have clear reasons, like redundancy or locality. Use libcluster for automatic node discovery.
- Prefer asynchronous processing with backpressure. Broadway and GenStage are pragmatic for data ingestion. For simple cases, Task.async_stream is enough.
- Build small, testable components. Use
ExUnitto simulate failures by stopping processes and asserting supervisors restore them.
Example project structure:
# Line-based folder structure
my_app/
mix.exs
config/
config.exs
runtime.exs
lib/
my_app/
application.ex
cluster/
supervisor.ex
connector.ex
workers/
supervisor.ex
counter.ex
connection_worker.ex
ingestion/
ingestion.ex
queue_producer.ex
transformer.ex
pubsub/
bus.ex
telemetry.ex
test/
my_app/
workers/
counter_test.exs
mix.exs:
defmodule MyApp.MixProject do
use Mix.Project
def project do
[
app: :my_app,
version: "0.1.0",
elixir: "~> 1.15",
start_permanent: Mix.env() == :prod,
deps: deps()
]
end
def application do
[
extra_applications: [:logger],
mod: {MyApp.Application, []}
]
end
defp deps do
[
{:libcluster, "~> 3.3"},
{:telemetry, "~> 1.2"},
{:broadway, "~> 1.0"},
{:finch, "~> 0.16"}
]
end
end
config/runtime.exs (example environment variables for clustering):
import Config
topologies = [
my_app: [
strategy: Cluster.Strategy.Gossip,
config: [
# Optionally filter nodes or set polling intervals
]
]
]
config :libcluster, topologies: topologies
lib/my_app/application.ex:
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
MyApp.Cluster.Supervisor,
MyApp.Workers.Supervisor,
MyApp.Ingestion
]
Supervisor.init(children, strategy: :one_for_one)
end
end
Tooling notes:
iex --sname alpha -S mixandiex --sname beta -S mixto run local nodes.Node.connect(:beta@your-hostname)to connect manually; libcluster automates this in production.:observer.start()opens a graphical tool to inspect processes, supervisors, and ETS tables. It’s invaluable for understanding supervision trees and message load.
Free learning resources
- Elixir official documentation: https://elixir-lang.org/docs.html
- Start with “Getting started” and “Mix and OTP” to understand supervision and GenServer.
- Erlang/OTP documentation: https://www.erlang.org/doc/
- Distribution and process semantics are well-documented here.
- Phoenix Framework guides: https://www.phoenixframework.org/
- Even if you’re not building web apps, Phoenix PubSub and Channels are excellent for distributed messaging patterns.
- Elixir School: https://elixirschool.com/
- A practical, community-driven resource covering core concepts and OTP.
- Pragmatic Studio’s Elixir/OTP course: https://pragmaticstudio.com/elixir-otp
- Paid but high quality; focuses on real patterns rather than theory.
- “Designing for Scalability with Erlang/OTP” by Francesco Cesarini and Steve Vinoski: https://www.oreilly.com/library/view/designing-for-scalability/9781449361556/
- The definitive book on OTP patterns, supervision, and distribution.
- Telemetry: https://hexdocs.pm/telemetry/
- Learn to instrument your applications for observability.
Summary and takeaways
Elixir’s fault tolerance is not a feature you bolt on; it’s a property of the runtime and the way you structure applications. In distributed systems, this means you can design for partial failures naturally, isolate work across lightweight processes, and rely on supervisors to restore healthy states. Combined with distribution protocols and tools like libcluster, Elixir offers a coherent model for multi-node deployments, especially where real-time behavior and graceful degradation are priorities.
Who should use Elixir:
- Teams building real-time, concurrent services that need to scale horizontally.
- Projects requiring long uptimes and online upgrades, such as IoT platforms or 24/7 control systems.
- Organizations that value operational simplicity and want to minimize custom resilience code.
Who might skip or complement it:
- Heavy compute workloads better suited to specialized stacks or native code.
- Systems that depend on strong global consistency; consider pairing Elixir with external consensus services.
- Teams locked into vendor SDKs without solid Elixir/Erlang support.
The practical path is to start with a clear supervision model, add distribution intentionally, and invest in observability. The result is not just “fault tolerance” as a buzzword, but a day-to-day reality where recovery is routine, outages are contained, and development focuses on features rather than firefighting.




