TIL #1: LakeFormation: Your Data Lake’s Double Lock 🔒

Tue Jul 29 2025

Data Lakes present a significant challenge in access governance. Managing granular permissions directly through IAM policies on S3 quickly becomes a complex nightmare at scale. This is where AWS Lake Formation comes in, but with a crucial subtlety: it's an additional permission layer.

blog_1_ad0gom.png

Don't be mistaken: Lake Formation DOES NOT REPLACE IAM policies. Accessing a data resource in your Data Lake is always subject to a double check, like a double door. You must be authorized by IAM first, and then by Lake Formation.

This system implies a strategic decision:

  • You can let IAM be the primary restriction gate, with Lake Formation "wide open."
  • Or let Lake Formation manage access granularity while having broader IAM permissions.

The key is to decide which door you want to actively manage for granularity. Trying to manage both simultaneously for the same restrictions can lead to confusion and redundant policies.

blog_2_nbn78m.png

Location Management: The Essential Prerequisite

For Lake Formation to secure your data, it must be aware of it. Any data you want to govern with Lake Formation that resides in an S3 bucket must have its S3 path registered in Lake Formation.

This registration isn't trivial:

  • It's performed via a specific registration role. This role is used by the Lake Formation service itself to access S3 objects and associated KMS keys. This means end-users accessing data via Lake Formation don't need direct IAM privileges on S3 or KMS; Lake Formation assumes these privileges on their behalf.
  • Registering a path includes all child paths. By registering a bucket name, you effectively register everything it contains.

Beware of the DATA_LOCATION_ACCESS trap: Glue Catalog databases and tables have location paths. If this path is not strictly registered (i.e., the exact path name), principals accessing data via Lake Formation will require an additional Lake Formation permission: DATA_LOCATION_ACCESS. This permission must be granted on all necessary registered paths. It's a critical nuance to ensure users can actually reach the underlying data.

blog_3_wyufzk.png