Learn about Update Domains and Fault Domains on Azure


Update Domains and Fault Domains are fundamental concepts in Azure’s high availability architecture, particularly when using Availability Sets.

Both are designed to help ensure that your applications remain available and resilient during hardware failures, maintenance events, and other disruptions.

Here's a detailed review of both concepts and how they work together.

Fault Domains (FD)

Definition

A Fault Domain (FD) is a logical grouping of physical hardware in an Azure data center.

Fault domains provide isolation at the hardware level to protect against infrastructure failures.

These failures can include things like power supply issues, hardware failures, network failures, or disk failures.

When you deploy Virtual Machines (VMs) in an Availability Set, Azure automatically spreads the VMs across multiple fault domains to reduce the impact of a failure.

Key Characteristics

Isolation of hardware failures

Fault domains are designed to protect against hardware failure in a specific server rack or hardware component.

For example, if the hardware in one fault domain fails, the VMs in other fault domains will continue to function.

Physical separation

Each fault domain typically represents a physical server rack with its own power supply, network, and cooling resources.

How Fault Domains Work

When you deploy a VM in an Availability Set, Azure ensures that the VMs are distributed across multiple fault domains.

If one fault domain experiences an issue (e.g., network failure, hardware failure), only the VMs in that fault domain will be affected.

Default Fault Domains:

For most scenarios, Azure will use 2 fault domains by default, but in some cases, it may be 3 depending on the region and the size of your deployment.

Example

You deploy three VMs in an Availability Set.

Azure will automatically place these VMs in 3 fault domains, ensuring that they are spread across different racks of physical servers.

If one rack fails, the other VMs will continue to operate on other racks.

Limitations

You can configure up to 3 fault domains in most regions and VM sizes.

The maximum fault domains available may be fewer for certain specialized VM sizes (like large memory or GPU-based VMs).

Use Case

If your app has high availability requirements and you want to ensure resilience against hardware failures, deploying your VMs across multiple fault domains is crucial.

Update Domains (UD)

Definition

An Update Domain (UD) is a logical grouping of VMs that can be updated or rebooted together during planned maintenance or platform updates by Azure.

The goal of update domains is to ensure that not all VMs are affected by planned maintenance at the same time.

When Azure performs system maintenance, such as patching or updates to the underlying platform (hypervisor, storage, etc.), it will apply updates to one update domain at a time.

This ensures that the other update domains remain operational, thus minimizing downtime for your application.

Key Characteristics

Isolation of planned maintenance

Update domains help avoid the scenario where all VMs in an Availability Set are taken down at once for maintenance.

Only the VMs in the current update domain are impacted by updates.

Logical grouping

Unlike fault domains, update domains don't map to specific physical infrastructure like server racks, but rather to software-controlled logical groups.

How Update Domains Work

When you create an Availability Set, Azure assigns VMs to update domains based on the number of update domains you specify.

For example, if you specify 5 update domains, Azure will distribute your VMs across 5 logical groups.

During maintenance, Azure will update only one update domain at a time (e.g., a reboot or patch), meaning other VMs are not affected and continue to serve traffic.

Example

You deploy 5 VMs in an Availability Set and specify 3 update domains.

Azure will distribute your VMs across the 3 update domains, meaning that during planned maintenance, only VMs in one update domain will be updated at a time.

Limitations

You can configure up to 20 update domains in an Availability Set, depending on the size of your deployment.

However, for large deployments, such as with VMSS (Virtual Machine Scale Sets), you may need to work with Azure Load Balancers or Application Gateway to ensure proper traffic distribution.

Use Case

If your app requires minimal downtime during planned maintenance (e.g., patching, updates), using update domains ensures that your application remains available, even if one update domain is undergoing maintenance.

Comparison: Fault Domains vs Update Domains

AspectFault DomainUpdate Domain
PurposeProtects against hardware failuresProtects against planned maintenance (reboots, patches)
ScopeAt the physical infrastructure level (e.g., racks, servers, power supply)At the software level (logical grouping for maintenance)
Isolation TypePhysical isolation of hardware resources (rack, power, network)Logical isolation for software updates and maintenance
Example Failure ScenarioHardware failure in one rack or serverMaintenance reboot or patching affecting only a subset of VMs
Default Setup in Availability SetAzure distributes VMs across multiple fault domains, usually up to 3.Azure distributes VMs across multiple update domains, usually 5-20.
Impact of FailureA failure in one fault domain affects only VMs in that domain.Only VMs in the update domain undergoing maintenance are affected.
Fault ToleranceProtects VMs against physical hardware issues.Protects VMs from simultaneous maintenance downtime.
Maximum AllowedUp to 3 fault domains in most regions.Up to 20 update domains for large deployments.

Best Practices for Configuring Fault and Update Domains

Ensure Sufficient Redundancy

Always deploy at least two VMs in separate fault domains to ensure high availability in case of hardware failures.

The minimum number of update domains should be at least 2 for better resilience during planned maintenance.

Balance VMs Across Fault and Update Domains

Ensure that your VMs are evenly distributed across fault and update domains for optimal redundancy and resilience.

Consider SLA Requirements

For a 99.95% uptime SLA on VMs, ensure that you have at least two VMs deployed in an Availability Set, distributed across multiple fault domains and update domains.

Test Failovers and Maintenance

Regularly test how your application behaves during failover scenarios or planned maintenance.

Ensure that services remain available when VMs are rebooted or undergo hardware failures.

Use Load Balancers

Integrate Azure Load Balancer or Application Gateway to distribute traffic among VMs deployed across different fault and update domains.

This ensures that traffic is always routed to healthy VMs.

Monitor Resource Health

Use Azure Monitor and Azure Service Health to keep track of the health of your VMs and understand when maintenance or failures might occur.

Advanced Considerations

Fault Domain + Availability Zones

While Availability Sets provide fault domain and update domain isolation within a single data center, consider using Availability Zones (AZs) for a higher level of protection against regional outages.

Availability Zones provide physical isolation across multiple data centers within a region, making them more resilient than Availability Sets alone.

VMSS (Virtual Machine Scale Sets)

When deploying VMSS, VMs are automatically distributed across fault and update domains.

You can configure auto-scaling to dynamically adjust the number of VMs based on demand, all while maintaining high availability.

Storage Considerations

Ensure that your storage accounts and disks are configured for high availability.

Azure Managed Disks and Azure Storage can be configured for ZRS (Zone-Redundant Storage) to further protect against failures.

Summary

Fault Domains (FD) provide protection against hardware failures by isolating VMs across different physical infrastructure, such as server racks.

Update Domains (UD) help mitigate planned maintenance by distributing VMs into logical groups that are updated one at a time during system maintenance or patching.

Both domains are integral to building highly available and resilient applications on Azure.

Availability Sets use these domains to ensure that your application can handle failures, maintenance, and scaling efficiently.

For even higher availability, Availability Zones offer cross-data-center redundancy within a region, providing more advanced protection against regional outages.

 

Related Articles


Rajnish, MCT

Leave a Reply

Your email address will not be published. Required fields are marked *


SUBSCRIBE

My newsletter for exclusive content and offers. Type email and hit Enter.

No spam ever. Unsubscribe anytime.
Read the Privacy Policy.