TIL #6: Why a 'Stable' CloudFormation Stack Rollback is Anything but Stable

Mon Sep 01 2025

A recent deployment issue taught me a lesson about the hidden complexities of cloud infrastructure and the true meaning of the Shared Responsibility Model. A failed CloudFormation stack should theoretically revert to a clean, stable state, but I learned firsthand that this isn't always the case. The result was a cascade of AlreadyExists errors on subsequent attempts, revealing a deeper philosophical tension at the heart of our cloud provider's design.


The AWS Philosophy of Total Flexibility

AWS is built on a foundation of immense flexibility. The Shared Responsibility Model dictates that while AWS handles the security of the cloud, we as users are entirely responsible for the security in the cloud, including our infrastructure and configurations. This philosophy is powerful. It allows for direct, manual control over almost any resource, from scaling an EC2 instance via a simple API call to making direct changes to service permissions. It gives developers total freedom, operating on the principle that "you're responsible for your own infrastructure."

However, this freedom often clashes with the core promise of Infrastructure as Code (IaC).


The IaC Promise of a Consistent State

The entire goal of IaC is to ensure that our code is the single source of truth for our infrastructure. A tool like CloudFormation is designed to maintain a consistent state, allowing us to deploy, update, and manage resources in a repeatable and predictable way.

The problem arises when this promise is broken. A stack rollback is supposed to revert any changes made during a failed deployment, but in practice, it often leaves behind what I call ghost resources. These are resources—like an orphaned Lake Formation permission or a DNS record—that were successfully created before the failure but were not properly cleaned up. CloudFormation's stack state believes these resources no longer exist, but they are very much alive in the AWS console. This forces us to manually intervene, directly contradicting the purpose of automation and revealing a major weakness in the service itself.


The Overlapping Operations Problem

A key flaw in this system is its lack of governance over concurrent operations. There are no built-in mechanisms to block manual changes while an automated deployment is in progress. This allows for an overlapping state where our IaC is fighting against a manual change, leading to corruption and instability. While this flexibility can be seen as an advantage for some, it is a significant liability for teams aiming for a consistent and reproducible environment.

This is where the philosophical difference between cloud providers becomes clear. While AWS prioritizes freedom, platforms like Azure and GCP often have a more restrictive, "guardian" philosophy, preventing manual changes that could compromise the integrity of an automated deployment. This can be a major advantage in a production setting, where reliability and consistency are non-negotiable


The Reality of Broken State Files

It’s easy to blame CloudFormation for these issues, but the truth is, this problem isn't unique to one tool. Terraform, another popular IaC tool, manages infrastructure by keeping a state file. If someone manually changes a resource while a Terraform run is in progress or while a stack is being deleted, it can corrupt the state file. When Terraform's state breaks, it can be much harder to fix. A broken state file can lose track of your entire infrastructure, not just a single resource.


Ultimately, we resolved the issue by manually cleaning the ghost resources from the AWS console. This frustrating experience served as a powerful reminder: while AWS provides immense power, it demands deep vigilance. The Shared Responsibility Model isn't just about security; it’s about acknowledging and managing the inherent tension between absolute flexibility and the predictable consistency that IaC promises.